Business Intelligence

Agenda

Importing Data
Data Exploration and Preparation
Model Training
Model Evaluation

Market Basket Analysis

Market basket analysis is used behind the scenes for the recommendation systems used in many brick-and-mortar and online retailers. The learned association rules indicate the combinations of items that are often purchased together.

In this tutorial, we will perform a market basket analysis of transactional data from a grocery store.

However, the techniques could be applied to many different types of problems, from movie recommendations, to dating sites, to finding dangerous interactions among medications.

Import Data

Our market basket analysis will utilize the purchase data collected from one month of operation at a real-world grocery store. The data contains 9,835 transactions or about 327 transactions per day.

First load the data groceries.csv from canvas Module Week 8, or click this link groceries.csv

groceries <- read.csv("groceries.csv")

Explore Dataset

Transactional data is stored in a slightly different format than that we used previously.

Most of our prior analyses utilized data in the matrix form where rows indicated example instances and columns indicated features.

Let’s first browse the data What differences do you notice? What problems do you notice?

str(groceries)

## 'data.frame':    15295 obs. of  4 variables:
##  $ citrus.fruit       : chr  "tropical fruit" "whole milk" "pip fruit" "other vegetables" ...
##  $ semi.finished.bread: chr  "yogurt" "" "yogurt" "whole milk" ...
##  $ margarine          : chr  "coffee" "" "cream cheese" "condensed milk" ...
##  $ ready.soups        : chr  "" "" "meat spreads" "long life bakery product" ...

Transactional Data

Most of our prior analyses utilized data in the matrix form where rows indicated example instances and columns indicated features.

Why not just store this as a data frame as we did in most of our analyses?

A conventional data structure quickly becomes too large to fit in the available memory with transactional data
We need a new data structure that does not treat a transaction as a set of positions to be filled (or not filled) with specific items

Create a Sparse Matrix

Row: Each row in the sparse matrix indicates a transaction.
Column: The sparse matrix has a column (that is, feature) for every item.
Memory: A sparse matrix does not actually store the full matrix in memory; it only stores the cells that are occupied by an item.

To create a sparse matrix, we can first install arules package, then load the package.

#install.packages("arules")
library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

#Create a sparse matrix
groceries <- read.transactions("groceries.csv", sep = ",")

#Explore the sparse matrix
summary(groceries)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The density value of 0.02609146 (2.6 percent) refers to the proportion of nonzero matrix cells.

Explore Sparse Matrix

To look at the contents of the sparse matrix, use the inspect() function in combination with the vector operators. The first three transactions can be viewed as follows.

inspect(groceries[1:3])

##     items                
## [1] {citrus fruit,       
##      margarine,          
##      ready soups,        
##      semi-finished bread}
## [2] {coffee,             
##      tropical fruit,     
##      yogurt}             
## [3] {whole milk}

Visualize Sparse Matrix

We can visualize the transation data by ploting entire sparse matrix. To do so, use the image() function.

The resulting diagram depicts a matrix with 5 rows and 169 columns, indicating the 5 transactions and 169 possible items we requested.

#Display the spars matrix for the first five transactions
image(groceries[1:5])

Sampled Visualization

This visualization will not be as useful for extremely large transaction databases, because the cells will be too small to discern.

Still, by combining it with the sample() function, you can view the sparse matrix for a randomly sampled set of transactions.

image(sample(groceries, 100))

View Support/Item Frequency

We can view the frequency of a certain item among all the transactions by using itemFrequency() function.

#To view the support level for the first three items in the grocery data:
itemFrequency(groceries[, 1:3])

## abrasive cleaner artif. sweetener   baby cosmetics 
##     0.0035587189     0.0032536858     0.0006100661

Visualize Support/Item Frequency

To present these statistics visually, use the itemFrequencyPlot() function. As shown in the following plot, this results in a histogram showing the eight items in the groceries data with at least 10 percent support:

itemFrequencyPlot(groceries, support = 0.1)

If you would rather limit the plot to a specific number of items, the topN parameter can be used with itemFrequencyPlot() by specifying topN option:

itemFrequencyPlot(groceries, topN = 20)

Training a Model

We can now work at finding associations among shopping cart items. The following table shows the syntax to create sets of rules with the apriori() function:

Select Thresholds

There can sometimes be some trial and error needed to find the support and confidence parameters that produce a reasonable number of association rules.

If you set these levels too high, you might find no rules or rules that are too generic to be very useful.
A threshold too low might result in an unwieldy number of rules, or worse, it may take a very long time or run out of memory during the learning phase.

Select Thresholds

Minimum support: Think about the smallest number of transactions you would need before you would consider a pattern interesting.
For instance, you could argue that if an item is purchased twice a day (about 60 times in a month of data), it may be an interesting pattern.Since 60 out of 9,835 equals 0.006, we’ll try setting the support there first.
We’ll start with a confidence threshold of 0.25, which means that in order to be included in the results, the rule has to be correct at least 25 percent of the time.

Generate Rules

groceryrules <- apriori(groceries, 
                        parameter = list(support =0.006, 
                                         confidence = 0.25,
                                         minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

groceryrules

## set of 463 rules

Evaluating Model

To obtain a high-level overview of the association rules, we can use summary() as follows.

summary(groceryrules)

## set of 463 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 150 297  16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.711   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.006101   Min.   :0.2500   Min.   :0.009964   Min.   :0.9932  
##  1st Qu.:0.007117   1st Qu.:0.2971   1st Qu.:0.018709   1st Qu.:1.6229  
##  Median :0.008744   Median :0.3554   Median :0.024809   Median :1.9332  
##  Mean   :0.011539   Mean   :0.3786   Mean   :0.032608   Mean   :2.0351  
##  3rd Qu.:0.012303   3rd Qu.:0.4495   3rd Qu.:0.035892   3rd Qu.:2.3565  
##  Max.   :0.074835   Max.   :0.6600   Max.   :0.255516   Max.   :3.9565  
##      count      
##  Min.   : 60.0  
##  1st Qu.: 70.0  
##  Median : 86.0  
##  Mean   :113.5  
##  3rd Qu.:121.0  
##  Max.   :736.0  
## 
## mining info:
##       data ntransactions support confidence
##  groceries          9835   0.006       0.25
##                                                                                         call
##  apriori(data = groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))

In our rule set, 150 rules have only two items, while 297 have three, and 16 have four.

Interpretation

We can take a look at specific rules using the inspect() function. For instance, the first three rules in the groceryrules object can be viewed as follows:

inspect(groceryrules[1:3])

##     lhs                rhs               support     confidence coverage  
## [1] {potted plants} => {whole milk}      0.006914082 0.4000000  0.01728521
## [2] {pasta}         => {whole milk}      0.006100661 0.4054054  0.01504830
## [3] {herbs}         => {root vegetables} 0.007015760 0.4312500  0.01626843
##     lift     count
## [1] 1.565460 68   
## [2] 1.586614 60   
## [3] 3.956477 69

Interpretation

Interpretation of the first rule:

If a customer buys potted plants, they will also buy whole milk.
This rule covers 0.7 percent of the transactions
It is correct in 40 percent of purchases involving potted plants
The lift value tells us how much more likely a customer is to buy whole milk relative to the average customer, given that he or she bought a potted plant.

Categorize Rules

A common approach is to take the association rules and divide them into the following three categories:

Actionable: provide a clear and useful insight
Trivial: rules are obvious but not worth-mentioning
Inexplicable: unclear connections between the items

Improve Model Performance - Sorting

Depending upon the objectives of the market basket analysis, the most useful rules might be the ones with the highest support, confidence, or lift.

inspect(sort(groceryrules, by = "lift")[1:5])

##     lhs                    rhs                      support confidence   coverage     lift count
## [1] {herbs}             => {root vegetables}    0.007015760  0.4312500 0.01626843 3.956477    69
## [2] {berries}           => {whipped/sour cream} 0.009049314  0.2721713 0.03324860 3.796886    89
## [3] {other vegetables,                                                                          
##      tropical fruit,                                                                            
##      whole milk}        => {root vegetables}    0.007015760  0.4107143 0.01708185 3.768074    69
## [4] {beef,                                                                                      
##      other vegetables}  => {root vegetables}    0.007930859  0.4020619 0.01972547 3.688692    78
## [5] {other vegetables,                                                                          
##      tropical fruit}    => {pip fruit}          0.009456024  0.2634561 0.03589222 3.482649    93

Subseting Association Rules

Suppose that given the preceding rule, the marketing team is excited about the possibilities of creating an advertisement to promote berries, which are now in season. Before finalizing the campaign, however, they ask you to investigate whether berries are often purchased with other items. To answer this question, we’ll need to find all the rules that include berries in some form.

The subset() function provides a method to search for subsets of transactions,items, or rules.

berryrules <- subset(groceryrules, items %in% "berries")
inspect(berryrules)

##     lhs          rhs                  support     confidence coverage  lift    
## [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713  0.0332486 3.796886
## [2] {berries} => {yogurt}             0.010574479 0.3180428  0.0332486 2.279848
## [3] {berries} => {other vegetables}   0.010269446 0.3088685  0.0332486 1.596280
## [4] {berries} => {whole milk}         0.011794611 0.3547401  0.0332486 1.388328
##     count
## [1]  89  
## [2] 104  
## [3] 101  
## [4] 116

Saving Association Rules to a File

To share the results of your market basket analysis, you can save the rules to a CSV file with the write() function.

write(groceryrules, file = "groceryrules.csv", 
      sep = ",", quote = TRUE, row.names = FALSE)

Summary

Association rules are frequently used to find useful insights in the massive transaction databases of large retailers
As an unsupervised learning process, we can extract knowledge from large databases without any prior knowledge of what patterns to seek
The challenge is to reduce the big data into manageable insight. We did this by setting proper thresholds of measurements of rules (support, confidence, lift)