Introduction

Market basket analysis is an unsupervised learning technique that can be useful for analyzing transactional data. It can be a powerful technique in analyzing the purchasing patterns of consumers. In this tutorial, we will examine the concept behind market basket analysis, introduce the apriori algorithm, as well conduct our own market basket analysis using R.

Market Basket Analysis

Market basket analysis is an association rule method that identifies associations in transactional data. It is an unsupervised machine learning technique used for knowledge discovery rather than prediction. This analysis results in a set of association rules that identify patterns of relationships among items. A rule can typicall be expressed in the form

\[ \begin{aligned} \{\text{peanut butter, jelly}\} \to \{\text{bread}\} \end{aligned} \]

The above rule states that if both peanut butter and jelly are purchased, then bread is also likely to be purchased.

The Apriori Algorithm

In real life, transactional data can often be complex and enormous in volume. Transactional data can be extremely large both in terms of the quantity of transactions and the number of items monitored. Given \(k\) items that can either appear or not appear in a set, there are \(2^{k}\) possible item sets that must be searched for rules. Thus, even if a retailer only has 100 distinct items, he could have \(2^{100}=1e+30\) item sets to evaluate, which is quite an impossible task. However, a smart rule learner algorithm can take advantage of the fact that in reality, many of the potential item combinations are rarely found in practice. For example, if a retailer sells both firearms and dairy products, a set of \(\{\text{gun, butter}\}\) are extremely likely to be common. By ignoring these rare cases, it makes it possible to limit the scope of the search for rules to a much more manageable size.

To resolve this issue Aragawal and R. Srikant introduced the apriori algorithm. The apriori algorithm utilizes a simple prior belief (hence the name a priori) about the properties of frequent items. Using this a priori belief, all subsets of frequent items must also be frequent. This makes it possible to limit the number of rules to search for. For example, the set \(\{\text{gun, butter}\}\) can only be frequent if \(\{\text{gun}\}\) and \(\{\text{butter}\}\) both occur frequently. Conversely, if neither \(\{\text{gun}\}\) nor \(\{\text{butter}\}\) are frequent, then any set containing these two items can be excluded from the search.

Measuring Rule Interest: Support and Confidence

There are two statistical measures that can be used to determine whether or not a rule is deemed “interesting.”

Support - This measures how frequently an item set occurs in the data. It can be calculated as

\[ \begin{aligned} \text{Support}(X)=\frac{\text{Count}(X)}{N} \end{aligned} \] where \(X\) represents an item and \(N\) represents the total number of transactions.

Confidence - This measures the algorithm’s predictive power or accuracy. It is calculated as the support of item \(X\) and \(Y\) divided by the support of item \(X\).

\[ \begin{aligned} \text{Confidence}(X \to Y)=\frac{\text{Support}(X,Y)}{\text{Support}(X)} \end{aligned} \] The important thing to note regarding confidence is that \(\text{Confidence}(X \to Y)≠\text{Confidence}(Y \to X)\). To illustrate, consider the following transactional table.

Table 1: Transactions
Transaction Purchases
1 {flowers, get well card, soda}
2 {toy bear, flowers, balloons, candy}
3 {get well card, candy, flowers}
4 {toy bear, balloons, soda}
5 {flowers, get well card, soda}

\[ \begin{aligned} \text{Confidence}(\text{get well card} \to \text{flowers}) = \frac{\text{Support}(\text{get well card}, \text{flowers})}{\text{Support(get well card)}}=\frac{0.6}{0.6}=1.0 \end{aligned} \]

\[ \begin{aligned} \text{Confidence}(\text{flowers} \to \text{get well card}) = \frac{\text{Support}(\text{flowers, get well card})}{\text{Support(flowers)}}= \frac{0.6}{0.8}=0.75 \end{aligned} \]

This means that a purchase of a get well card results in a purchase of flowers 100% of the time, while a purchase of flowers results in a purchase of a get well card 75% of the time. Rules likes \(\{\text{get well card}\} \to \{\text{flowers}\}\) are considered strong rules because they have both high support and confidence.

How the Apriori Algorithm Works

The way in which the apriori algorithm creates rules is relatively straightforward.

  1. Identify all item sets that meet a minimum support threshold - This process occurs in multiple iterations. Each successive iteration evaluates the support of storing a set of increasingly large items. The first iteration involves evaluating the set of of 1-item sets. The second iteration involves evaluating the set of 2-item sets, and so on. The result of each iteration i is a set of i-itemsets that meet the minimum threshold. All item sets from iteration i are combined in order to generate candidate item sets for evaluation in iteration i+1. The apriori principle can eliminate some of the items before the next iteration begins. For example, if \(\{\text{A}\}\), \(\{\text{B}\}\), and \(\{\text{C}\}\) are frequent in iteration 1, but \(\{\text{D}\}\) is not, then the second iteration will only consider the item sets \(\{\text{A, B}\}\), \(\{\text{A, C}\}\), and \(\{\text{B, C}\}\).

  2. Create rules from these items that meet a minimum confidence threshold.

An Example in R

In order to perform our market basket analysis, you will first need to install the arules if you haven’t already done so. For our data, we will be using a fictional list of grocery transactions, which can be downloaded here. Let’s first load, inspect, and clean our data.

groceries <- read.csv('/Users/cyobero/Desktop/groceries.csv', header = FALSE)
head(groceries)
##                 V1                  V2             V3
## 1     citrus fruit semi-finished bread      margarine
## 2   tropical fruit              yogurt         coffee
## 3       whole milk                                   
## 4        pip fruit              yogurt  cream cheese 
## 5 other vegetables          whole milk condensed milk
## 6       whole milk              butter         yogurt
##                         V4
## 1              ready soups
## 2                         
## 3                         
## 4             meat spreads
## 5 long life bakery product
## 6                     rice

Our data contains rows which represent a customer and columns corresponding to that customer’s item purchases. However, you’ll notice that R has decided to name our four columns V1, V2, V3, and V4. This is due to the fact that when we loaded the CSV file we chose not to include headers. This problematic since grocery purchases can contain more than four items, these transactions will be broken across multiple rows in the matrix.

Fortunately, by reading our original groceries file as a sparse matrix, we can remedy this issue. Luckily, the \(\text{arules}\) package has a \(\text{read.transactions()}\) function that allows us to easily read our groceries files as a sparse matrix. Let’s go ahead and do that.

library(arules)
groceries <- read.transactions('/Users/cyobero/Desktop/groceries.csv', sep=',')
summary(groceries)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

How convenient, right!? Let’s examine the output. We can see that the most commonly purchased items in this data set are whole milk, other vegetables, rolls/buns, soda, and yogurt.

The first block in ouroutput provides a summary of our sparse matrix. In this example, the data contains 9,835 transactions (rows) containing 169 different items (columns). Each cell in the sparse matrix contains a 1 if that item in a particular transaction was purchased and 0 otherwise. Density refers to the proportion of non-zero cells in the matrix.

The second block of our ouptut contains summary statistics about the size of each transaction. For example, there were 2,159 transactions in which only 1 item was bought and one transaction in which 32 items were bought.

Let’s now examine some of the items that were purchased for the first 10 transactions.

inspect(groceries[1:10])
##    items                     
## 1  {citrus fruit,            
##     margarine,               
##     ready soups,             
##     semi-finished bread}     
## 2  {coffee,                  
##     tropical fruit,          
##     yogurt}                  
## 3  {whole milk}              
## 4  {cream cheese,            
##     meat spreads,            
##     pip fruit,               
##     yogurt}                  
## 5  {condensed milk,          
##     long life bakery product,
##     other vegetables,        
##     whole milk}              
## 6  {abrasive cleaner,        
##     butter,                  
##     rice,                    
##     whole milk,              
##     yogurt}                  
## 7  {rolls/buns}              
## 8  {bottled beer,            
##     liquor (appetizer),      
##     other vegetables,        
##     rolls/buns,              
##     UHT-milk}                
## 9  {pot plants}              
## 10 {cereals,                 
##     whole milk}

We can also examine the frequency of items purchased in our data. This is useful in allowing us to view the support for each item. Let’s examine the first 4 items (the columns are listed in alphabetical order).

itemFrequency(groceries[, 1:4])
## abrasive cleaner artif. sweetener   baby cosmetics        baby food 
##     0.0035587189     0.0032536858     0.0006100661     0.0001016777

We can see that the support abrasive cleaner, artificial sweetener, baby cosmetics, and baby food are 0.36%, 0.32%, 0.061%, and 0.01%, respectively. Let’s now visualize the item frequencies using items that have a support of at least 7.8%.

itemFrequencyPlot(groceries, support = 0.078)

Now let’s look at the top 25 items in terms of support.

itemFrequencyPlot(groceries, topN = 25)

Training Our Model

We’ve gained quite a bit of information about our grocery data. However, we’ve yet to actually use the apriori algorithm to find transactional patterns in our data set. Doing so is fairly straightforward. We simply call the \(\text{apriori()}\) function and provide a list of parameters, those parameters being the support level, confidence level, and minimum length of each item set. To start, we’ll use a support level of 0.3%, a confidence level of 25%, and a minimum length of 2 so that we can eliminate rules containing fewer than two items.

grocery.rules <- apriori(groceries, parameter = list(support = 0.003, confidence = 0.25, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##        0.25    0.1    1 none FALSE            TRUE   0.003      2     10
##  target   ext
##   rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 29 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [136 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [1771 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
grocery.rules
## set of 1771 rules
summary(grocery.rules)
## set of 1771 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5 
##  228 1207  326   10 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.067   3.000   5.000 
## 
## summary of quality measures:
##     support           confidence          lift        
##  Min.   :0.003050   Min.   :0.2500   Min.   : 0.9932  
##  1st Qu.:0.003457   1st Qu.:0.3056   1st Qu.: 1.8089  
##  Median :0.004270   Median :0.3846   Median : 2.1879  
##  Mean   :0.005984   Mean   :0.4055   Mean   : 2.3102  
##  3rd Qu.:0.006202   3rd Qu.:0.4924   3rd Qu.: 2.6962  
##  Max.   :0.074835   Max.   :0.8857   Max.   :11.4214  
## 
## mining info:
##       data ntransactions support confidence
##  groceries          9835   0.003       0.25

Our summary output provides us with summary statistics on our model’s support, confidence, and lift. We already know what support and confidence are, but what is lift? Lift is a measure of how much more likely one item is to be purchased relative to its typical purchase rate, given that you know another item has been purchased. It can be expressed as

\[ \begin{aligned} \text{Lift}(X \to Y) = \frac{\text{Confidence}(X \to Y)}{\text{Support(Y)}} \end{aligned} \]

For example, most people probably purchase milk AND cereal. Purely by chance, we would expect to find many transactions that contain both milk and cereal. Thus if \(\text{Lift}(\text{milk} \to \text{cereal}) > 1\), this implies that the two items are purchased together more ofen than one would expect by chance. A large lift value is a strong indicator that a rule is important and therefore reflects a true connection between two items. It’s also important to note that unlike support, where \(\text{Support}(X \to Y) ≠ \text{Support}(Y \to X)\), \(\text{Lift}(X \to Y) = \text{Lift}(Y \to X)\).

We can now look at the rules for our model using the \(\text{inspect()}\) function.

inspect(grocery.rules[1:10])
##    lhs                        rhs                support     confidence
## 1  {liquor}                => {bottled beer}     0.004677173 0.4220183 
## 2  {cereals}               => {whole milk}       0.003660397 0.6428571 
## 3  {candles}               => {whole milk}       0.003050330 0.3409091 
## 4  {soups}                 => {other vegetables} 0.003152008 0.4626866 
## 5  {Instant food products} => {hamburger meat}   0.003050330 0.3797468 
## 6  {Instant food products} => {whole milk}       0.003050330 0.3797468 
## 7  {specialty cheese}      => {other vegetables} 0.004270463 0.5000000 
## 8  {specialty cheese}      => {whole milk}       0.003762074 0.4404762 
## 9  {chocolate marshmallow} => {whole milk}       0.003152008 0.3483146 
## 10 {flower (seeds)}        => {other vegetables} 0.003762074 0.3627451 
##    lift     
## 1   5.240594
## 2   2.515917
## 3   1.334199
## 4   2.391236
## 5  11.421438
## 6   1.486196
## 7   2.584078
## 8   1.723869
## 9   1.363181
## 10  1.874723

If you’re wondering what lhs and rhs stand for, it simple means left-hand side and right-hand side. Our results are pretty interesting and somewhat intuitive and obvious. For example, a purchase of cereal results in a purchase of whole milk 64.3% of the time (I typically prefer my cereal with a bottle of Veuve Cliquot, but hey, that’s just how I roll).

While our results are nice, they really don’t provide us with new and valuable insights. We don’t really need to perform a market basket analysis to figure out that consumers who purchase cereal are probably likely to also purchase milk, or that consumers who purchase soup are more likely to purchase other vegetables. Perhaps the beauty of market basket analysis is finding latent patterns in your data. Let’s see if we can find some interesting rules by sorting our model by lift.

inspect(sort(grocery.rules, by = 'lift')[1:20])
##    lhs                        rhs                      support confidence      lift
## 1  {Instant food products} => {hamburger meat}     0.003050330  0.3797468 11.421438
## 2  {flour}                 => {sugar}              0.004982206  0.2865497  8.463112
## 3  {processed cheese}      => {white bread}        0.004168785  0.2515337  5.975445
## 4  {citrus fruit,                                                                  
##     other vegetables,                                                              
##     tropical fruit,                                                                
##     whole milk}            => {root vegetables}    0.003152008  0.6326531  5.804238
## 5  {other vegetables,                                                              
##     root vegetables,                                                               
##     tropical fruit,                                                                
##     whole milk}            => {citrus fruit}       0.003152008  0.4492754  5.428284
## 6  {liquor}                => {bottled beer}       0.004677173  0.4220183  5.240594
## 7  {citrus fruit,                                                                  
##     other vegetables,                                                              
##     root vegetables,                                                               
##     whole milk}            => {tropical fruit}     0.003152008  0.5438596  5.183004
## 8  {berries,                                                                       
##     whole milk}            => {whipped/sour cream} 0.004270463  0.3620690  5.050990
## 9  {herbs,                                                                         
##     whole milk}            => {root vegetables}    0.004168785  0.5394737  4.949369
## 10 {tropical fruit,                                                                
##     whole milk,                                                                    
##     yogurt}                => {curd}               0.003965430  0.2617450  4.912713
## 11 {other vegetables,                                                              
##     whipped/sour cream,                                                            
##     whole milk}            => {butter}             0.003965430  0.2708333  4.887424
## 12 {butter,                                                                        
##     other vegetables,                                                              
##     whole milk}            => {whipped/sour cream} 0.003965430  0.3451327  4.814724
## 13 {herbs,                                                                         
##     other vegetables}      => {root vegetables}    0.003863752  0.5000000  4.587220
## 14 {citrus fruit,                                                                  
##     root vegetables,                                                               
##     tropical fruit,                                                                
##     whole milk}            => {other vegetables}   0.003152008  0.8857143  4.577509
## 15 {onions,                                                                        
##     whole milk}            => {butter}             0.003050330  0.2521008  4.549379
## 16 {butter,                                                                        
##     other vegetables,                                                              
##     yogurt}                => {tropical fruit}     0.003050330  0.4761905  4.538114
## 17 {citrus fruit,                                                                  
##     other vegetables,                                                              
##     tropical fruit}        => {root vegetables}    0.004473818  0.4943820  4.535678
## 18 {beef,                                                                          
##     tropical fruit}        => {root vegetables}    0.003762074  0.4933333  4.526057
## 19 {onions,                                                                        
##     other vegetables,                                                              
##     whole milk}            => {root vegetables}    0.003253686  0.4923077  4.516648
## 20 {beef,                                                                          
##     soda}                  => {root vegetables}    0.003965430  0.4875000  4.472540

Now we have some interesting patterns. Apparently people who purchase tropical fruit, whole milk, and yogurt are nearly 5 times more likely to purchase curd (WTF is curd?) than the typical consumer. This kind of information could prove useful to a retailer who might want to use this information and stock curd next to fruit, whole, milk, and yogurt. Sidenote - I just Googled what curd is and realized that it’s a dairy product, so I guess this “new” information isn’t really that surprising.

Aside from sorting our list by lift, we can also look at the support, confidence, and lift of a specific item. Let’s take a look at the transactional patterns of ham purchases.

ham.rules <- subset(grocery.rules, items %in% 'ham')
inspect(ham.rules)
##     lhs                       rhs                support     confidence
## 120 {ham}                  => {yogurt}           0.006710727 0.2578125 
## 121 {ham}                  => {rolls/buns}       0.006914082 0.2656250 
## 122 {ham}                  => {other vegetables} 0.009150991 0.3515625 
## 123 {ham}                  => {whole milk}       0.011489578 0.4414062 
## 289 {ham,yogurt}           => {other vegetables} 0.003050330 0.4545455 
## 290 {ham,other vegetables} => {yogurt}           0.003050330 0.3333333 
## 291 {ham,yogurt}           => {whole milk}       0.003965430 0.5909091 
## 292 {ham,whole milk}       => {yogurt}           0.003965430 0.3451327 
## 293 {ham,rolls/buns}       => {whole milk}       0.003457041 0.5000000 
## 294 {ham,whole milk}       => {rolls/buns}       0.003457041 0.3008850 
## 295 {ham,other vegetables} => {whole milk}       0.004778851 0.5222222 
## 296 {ham,whole milk}       => {other vegetables} 0.004778851 0.4159292 
##     lift    
## 120 1.848095
## 121 1.444125
## 122 1.816930
## 123 1.727509
## 289 2.349162
## 290 2.389456
## 291 2.312611
## 292 2.474038
## 293 1.956825
## 294 1.635823
## 295 2.043794
## 296 2.149587

Now, this is pretty interesting. Consumers who purchase ham are nearly twice as likely to purchase yogurt than the typical consumer. Umm, ok. You can do this with any item that appears in your model.

Conclusion

Market basket analysis is an unsupervised machine learning technique that can be useful for finding patterns in transactional data. It can be a very powerful tool for analyzing the purchasing patterns of consumers. The main algorithm used in market basket analysis is the apriori algorithm. The three statistical measures in market basket analysis are support, confidence, and lift. Support measures the frequency an item appears in a given transactional data set, confidence measures the algorithm’s predictive power or accuracy, and lift measures how much more likely an item is purchased relative to its typical purchase rate. In our example, we examined the transactional patterns of grocery purchases and discovered both obvious and not-so-obvious patterns in certain transactions.