Agenda

  1. Importing Data

  2. Data Exploration and Preparation

  3. Model Training

  4. Model Evaluation

Market Basket Analysis

Market basket analysis is used behind the scenes for the recommendation systems used in many brick-and-mortar and online retailers. The learned association rules indicate the combinations of items that are often purchased together.

In this tutorial, we will perform a market basket analysis of transactional data from a grocery store.

However, the techniques could be applied to many different types of problems, from movie recommendations, to dating sites, to finding dangerous interactions among medications.

Import Data

Our market basket analysis will utilize the purchase data collected from one month of operation at a real-world grocery store. The data contains 9,835 transactions or about 327 transactions per day.

First load the data groceries.csv from canvas.

groceries <- read.csv("groceries.csv")

Explore Dataset

Transactional data is stored in a slightly different format than that we used previously.

Most of our prior analyses utilized data in the matrix form where rows indicated example instances and columns indicated features.

Let’s first browse the data What differences do you notice? What problems do you notice?

str(groceries)
## 'data.frame':    15295 obs. of  4 variables:
##  $ citrus.fruit       : chr  "tropical fruit" "whole milk" "pip fruit" "other vegetables" ...
##  $ semi.finished.bread: chr  "yogurt" "" "yogurt" "whole milk" ...
##  $ margarine          : chr  "coffee" "" "cream cheese" "condensed milk" ...
##  $ ready.soups        : chr  "" "" "meat spreads" "long life bakery product" ...

Transactional Data

Most of our prior analyses utilized data in the matrix form where rows indicated example instances and columns indicated features.

Why not just store this as a data frame as we did in most of our analyses?

  • A conventional data structure quickly becomes too large to fit in the available memory with transactional data

  • We need a new data structure that does not treat a transaction as a set of positions to be filled (or not filled) with specific items

Create a Sparse Matrix

  • Row: Each row in the sparse matrix indicates a transaction.

  • Column: The sparse matrix has a column (that is, feature) for every item.

  • Memory: A sparse matrix does not actually store the full matrix in memory; it only stores the cells that are occupied by an item.

To create a sparse matrix, we can first install arules package, then load the package.

#install.packages("arules")
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
#Create a sparse matrix
groceries <- read.transactions("groceries.csv", sep = ",")

#Explore the sparse matrix
summary(groceries)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The density value of 0.02609146 (2.6 percent) refers to the proportion of nonzero matrix cells.

Explore Sparse Matrix

To look at the contents of the sparse matrix, use the inspect() function in combination with the vector operators. The first three transactions can be viewed as follows.

inspect(groceries[1:3])
##     items                
## [1] {citrus fruit,       
##      margarine,          
##      ready soups,        
##      semi-finished bread}
## [2] {coffee,             
##      tropical fruit,     
##      yogurt}             
## [3] {whole milk}

Visualize Sparse Matrix

We can visualize the transation data by ploting entire sparse matrix. To do so, use the image() function.

The resulting diagram depicts a matrix with 5 rows and 169 columns, indicating the 5 transactions and 169 possible items we requested.

#Display the spars matrix for the first five transactions
image(groceries[1:5])

Sampled Visualization

This visualization will not be as useful for extremely large transaction databases, because the cells will be too small to discern.

Still, by combining it with the sample() function, you can view the sparse matrix for a randomly sampled set of transactions.

image(sample(groceries, 100))

View Support/Item Frequency

We can view the frequency of a certain item among all the transactions by using itemFrequency() function.

#To view the support level for the first three items in the grocery data:
itemFrequency(groceries[, 1:3])
## abrasive cleaner artif. sweetener   baby cosmetics 
##     0.0035587189     0.0032536858     0.0006100661

Visualize Support/Item Frequency

To present these statistics visually, use the itemFrequencyPlot() function. As shown in the following plot, this results in a bar chart showing the eight items in the groceries data with at least 10 percent support:

itemFrequencyPlot(groceries, support = 0.1)

If you would rather limit the plot to a specific number of items, the topN parameter can be used with itemFrequencyPlot() by specifying topN option:

itemFrequencyPlot(groceries, topN = 20)

Training a Model

We can now work at finding associations among shopping cart items. The following table shows the syntax to create sets of rules with the apriori() function.

Select Thresholds

There can sometimes be some trial and error needed to find the support and confidence parameters that produce a reasonable number of association rules.

  • If you set these levels too high, you might find no rules or rules that are too generic to be very useful.

  • A threshold too low might result in an unwieldy number of rules, or worse, it may take a very long time or run out of memory during the learning phase.

Select Thresholds

  • Minimum support: Think about the smallest number of transactions you would need before you would consider a pattern interesting.

  • For instance, you could argue that if an item is purchased twice a day (about 60 times in a month of data), it may be an interesting pattern.Since 60 out of 9,835 equals 0.006, we’ll try setting the support there first.

  • We’ll start with a confidence threshold of 0.25.

Generate Rules

groceryrules <- apriori(groceries, 
                        parameter = list(support =0.006, 
                                         confidence = 0.25,
                                         minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
groceryrules
## set of 463 rules

Evaluating Model

To obtain a high-level overview of the association rules, we can use summary() as follows.

summary(groceryrules)
## set of 463 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 150 297  16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.711   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.006101   Min.   :0.2500   Min.   :0.009964   Min.   :0.9932  
##  1st Qu.:0.007117   1st Qu.:0.2971   1st Qu.:0.018709   1st Qu.:1.6229  
##  Median :0.008744   Median :0.3554   Median :0.024809   Median :1.9332  
##  Mean   :0.011539   Mean   :0.3786   Mean   :0.032608   Mean   :2.0351  
##  3rd Qu.:0.012303   3rd Qu.:0.4495   3rd Qu.:0.035892   3rd Qu.:2.3565  
##  Max.   :0.074835   Max.   :0.6600   Max.   :0.255516   Max.   :3.9565  
##      count      
##  Min.   : 60.0  
##  1st Qu.: 70.0  
##  Median : 86.0  
##  Mean   :113.5  
##  3rd Qu.:121.0  
##  Max.   :736.0  
## 
## mining info:
##       data ntransactions support confidence
##  groceries          9835   0.006       0.25
##                                                                                         call
##  apriori(data = groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))

In our rule set, 150 rules have only two items, while 297 have three, and 16 have four.

Interpretation

We can take a look at specific rules using the inspect() function. For instance, the first three rules in the groceryrules object can be viewed as follows:

inspect(groceryrules[1:3])
##     lhs                rhs               support     confidence coverage  
## [1] {potted plants} => {whole milk}      0.006914082 0.4000000  0.01728521
## [2] {pasta}         => {whole milk}      0.006100661 0.4054054  0.01504830
## [3] {herbs}         => {root vegetables} 0.007015760 0.4312500  0.01626843
##     lift     count
## [1] 1.565460 68   
## [2] 1.586614 60   
## [3] 3.956477 69

Interpretation

Interpretation of the first rule:

  • If a customer buys potted plants, they will also buy whole milk.

  • This rule covers 0.7 percent of the transactions

  • It is correct in 40 percent of purchases involving potted plants

  • The lift value tells us how much more likely a customer is to buy whole milk relative to the average customer, given that he or she bought a potted plant.

Categorize Rules

A common approach is to take the association rules and divide them into the following three categories:

  • Actionable: provide a clear and useful insight

  • Trivial: rules are obvious but not worth-mentioning

  • Inexplicable: unclear connections between the items

Improve Model Performance - Sorting

Depending upon the objectives of the market basket analysis, the most useful rules might be the ones with the highest support, confidence, or lift.

inspect(sort(groceryrules, by = "lift")[1:5])
##     lhs                    rhs                      support confidence   coverage     lift count
## [1] {herbs}             => {root vegetables}    0.007015760  0.4312500 0.01626843 3.956477    69
## [2] {berries}           => {whipped/sour cream} 0.009049314  0.2721713 0.03324860 3.796886    89
## [3] {other vegetables,                                                                          
##      tropical fruit,                                                                            
##      whole milk}        => {root vegetables}    0.007015760  0.4107143 0.01708185 3.768074    69
## [4] {beef,                                                                                      
##      other vegetables}  => {root vegetables}    0.007930859  0.4020619 0.01972547 3.688692    78
## [5] {other vegetables,                                                                          
##      tropical fruit}    => {pip fruit}          0.009456024  0.2634561 0.03589222 3.482649    93

Subseting Association Rules

Suppose that given the preceding rule, the marketing team is excited about the possibilities of creating an advertisement to promote berries, which are now in season. Before finalizing the campaign, however, they ask you to investigate whether berries are often purchased with other items. To answer this question, we’ll need to find all the rules that include berries in some form.

The subset() function provides a method to search for subsets of transactions,items, or rules.

berryrules <- subset(groceryrules, items %in% "berries")
inspect(berryrules)
##     lhs          rhs                  support     confidence coverage  lift    
## [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713  0.0332486 3.796886
## [2] {berries} => {yogurt}             0.010574479 0.3180428  0.0332486 2.279848
## [3] {berries} => {other vegetables}   0.010269446 0.3088685  0.0332486 1.596280
## [4] {berries} => {whole milk}         0.011794611 0.3547401  0.0332486 1.388328
##     count
## [1]  89  
## [2] 104  
## [3] 101  
## [4] 116

Saving Association Rules to a File

To share the results of your market basket analysis, you can save the rules to a CSV file with the write() function.

write(groceryrules, file = "groceryrules.csv", 
      sep = ",", quote = TRUE, row.names = FALSE)

Summary

  • Association rules are frequently used to find useful insights in the massive transaction databases of large retailers

  • As an unsupervised learning process, we can extract knowledge from large databases without any prior knowledge of what patterns to seek

  • The challenge is to reduce the big data into manageable insight. We did this by setting proper thresholds of measurements of rules (support, confidence, lift)

Your Turn

  1. Identify all association rules that include “whole milk” and sort them by confidence.

  2. Examine the results. Does a higher confidence value always correspond to a higher lift?

  3. Under what conditions does a higher confidence lead to a higher lift? Explain why by referring to the formulas for confidence and lift.

NBA Case - How to Turn a Regular data to a Transactional data

library(tidyverse)
NBA<-read.csv("NBA.csv")
head(NBA)
##           Player  Salary Pos Age  Tm  G   MP  P3  P2  FT TRB AST STL BLK TOV
## 1     Saddiq Bey 2959080  SF  23 ATL  4 24.8 2.5 0.8 0.8 4.5 0.5 0.5 0.3 1.0
## 2 Jarrett Culver  260295  SG  23 ATL 10 13.7 0.1 1.6 0.9 3.8 0.6 0.6 0.2 0.7
## 3  Trent Forrest  508891  PG  24 ATL 18 13.3 0.0 1.3 0.1 1.6 1.7 0.3 0.1 0.6
## 4     AJ Griffin 3536160  SF  19 ATL 57 20.2 1.5 2.2 0.5 2.1 1.0 0.7 0.1 0.6
## 5  Aaron Holiday 1968175  PG  26 ATL 51 14.3 0.6 0.9 0.5 1.3 1.3 0.5 0.2 0.6
## 6  Jalen Johnson 2792640  SF  21 ATL 57 14.3 0.4 1.6 0.7 3.8 0.9 0.5 0.5 0.5
##    PF PTS
## 1 1.3 9.8
## 2 1.4 4.4
## 3 0.9 2.8
## 4 1.2 9.2
## 5 1.3 4.1
## 6 1.4 5.2

Turn numeric variables to categories

For example, if the player’s salary is higher than the 75 percentile - he is categorized as high salary.

If the player’s salary is lower than the 25 percentile - he is categorized as low salary.

NBA <- NBA %>%
  mutate(
    across(
      where(is.numeric),
      ~ case_when(
        . < quantile(., 0.25, na.rm = TRUE) ~ "Low",
        . > quantile(., 0.75, na.rm = TRUE) ~ "High",
        TRUE ~ NA_character_
      ),
      .names = "{.col}."
    )
  )

Take all categorical variables

NBAasso<-NBA[18:31] #The new variables are from 18 to 31 columns

Transfer the data into transactional format

NBAasso[] <- lapply(NBAasso, as.factor)
library(arules)
trans <- as(NBAasso, "transactions") 

Overview of the dataset

summary(trans)
## transactions as itemMatrix in sparse format with
##  176 rows (elements/itemsets/transactions) and
##  28 columns (items) and a density of 0.2262581 
## 
## most frequent items:
## Salary.=Low   Age.=High      G.=Low    MP.=High     MP.=Low     (Other) 
##          44          44          44          44          44         895 
## 
## element (itemset/transaction) length distribution:
## sizes
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 
##  2  5 15 22 16 16 16 18 16 14 15 11  7  3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.750   6.000   6.335   9.000  13.000 
## 
## includes extended item information - examples:
##         labels variables levels
## 1 Salary.=High   Salary.   High
## 2  Salary.=Low   Salary.    Low
## 3    Age.=High      Age.   High
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3
inspect(trans[1:10])
##      items          transactionID
## [1]  {G.=Low,                    
##       MP.=High,                  
##       P3.=High,                  
##       P2.=Low,                   
##       TRB.=High,                 
##       AST.=Low,                  
##       PTS.=High}               1 
## [2]  {Salary.=Low,               
##       G.=Low,                    
##       P3.=Low,                   
##       AST.=Low}                2 
## [3]  {Salary.=Low,               
##       G.=Low,                    
##       P3.=Low,                   
##       FT.=Low,                   
##       TRB.=Low,                  
##       PF.=Low,                   
##       PTS.=Low}                3 
## [4]  {Age.=Low,                  
##       G.=High,                   
##       P3.=High}                4 
## [5]  {TRB.=Low}                5 
## [6]  {Age.=Low,                  
##       G.=High}                 6 
## [7]  {Age.=High,                 
##       MP.=Low,                   
##       P2.=Low,                   
##       FT.=Low,                   
##       TRB.=Low,                  
##       STL.=Low,                  
##       BLK.=Low,                  
##       TOV.=Low,                  
##       PF.=Low,                   
##       PTS.=Low}                7 
## [8]  {Salary.=Low,               
##       G.=Low,                    
##       MP.=Low,                   
##       P2.=Low,                   
##       FT.=Low,                   
##       TRB.=Low,                  
##       AST.=Low,                  
##       STL.=Low,                  
##       BLK.=Low,                  
##       TOV.=Low,                  
##       PF.=Low,                   
##       PTS.=Low}                8 
## [9]  {Salary.=Low,               
##       G.=Low,                    
##       MP.=Low,                   
##       P3.=Low,                   
##       P2.=Low,                   
##       FT.=Low,                   
##       TRB.=Low,                  
##       AST.=Low,                  
##       STL.=Low,                  
##       BLK.=Low,                  
##       TOV.=Low,                  
##       PF.=Low,                   
##       PTS.=Low}                9 
## [10] {Age.=High}               10

Train the Model

NBArules <- apriori(trans, 
                        parameter = list(support =0.05, 
                                         confidence = 0.1,
                                         minlen = 2,
                                         maxlen = 3))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 8 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[28 item(s), 176 transaction(s)] done [0.00s].
## sorting and recoding items ... [28 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3
## Warning in apriori(trans, parameter = list(support = 0.05, confidence = 0.1, :
## Mining stopped (maxlen reached). Only patterns up to a length of 3 returned!
##  done [0.00s].
## writing ... [1684 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(NBArules)
## set of 1684 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3 
##  358 1326 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   2.787   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.05114   Min.   :0.2045   Min.   :0.05114   Min.   :0.8182  
##  1st Qu.:0.06250   1st Qu.:0.5116   1st Qu.:0.10795   1st Qu.:2.2345  
##  Median :0.07955   Median :0.6250   Median :0.13068   Median :2.7273  
##  Mean   :0.08530   Mean   :0.6231   Mean   :0.14464   Mean   :2.6962  
##  3rd Qu.:0.10227   3rd Qu.:0.7407   3rd Qu.:0.16477   3rd Qu.:3.1590  
##  Max.   :0.19318   Max.   :1.0000   Max.   :0.25000   Max.   :4.4000  
##      count      
##  Min.   : 9.00  
##  1st Qu.:11.00  
##  Median :14.00  
##  Mean   :15.01  
##  3rd Qu.:18.00  
##  Max.   :34.00  
## 
## mining info:
##   data ntransactions support confidence
##  trans           176    0.05        0.1
##                                                                                               call
##  apriori(data = trans, parameter = list(support = 0.05, confidence = 0.1, minlen = 2, maxlen = 3))

Explore further associations

About Salary

Salaryrules <- subset(NBArules, rhs %in% "Salary.=High"& size(rhs) == 1)
inspect(Salaryrules)
inspect(sort(Salaryrules, by = "lift")[1:10])

About Age

Agerules <- subset(NBArules, lhs %in% "Age.=High"& size(lhs) == 1)
inspect(Agerules)
inspect(sort(Agerules, by = "lift")[1:10])

Your Turn

Examine the variables Salary and Age, highlight one notable finding for each, and discuss the corresponding actionable insight.