Market Basket Analysis using Association Rules

Preliminary Discussion

Retailers

Retailers use sophisticated data analysis techniques to identify patterns that will drive retail behavior. Nowadays, barcode scanners, inventory databases, and online shopping carts have generated tons of transactional data that machine learning can use to learn purchasing patterns. The practice is commonly known as market basket analysis due to the fact that it has been so frequently applied to supermarket data.

Market Basket Analysis

Essentially Market Basket Analysis is one of the key techniques used by the large retailers to identify relationships between the items that people buy. This allows them to find and uncovers associations between items by looking for combinations of items that occur together frequently in transactions.

Although the technique originated with shopping data, it is also useful in other contexts. However for the scope of this project we shall remain in the consumer and shopping context.

Association Rules

The result of a market basket analysis is a collection of association rules that specify patterns found in the relationships among items in the item sets. Association rules are always composed from subsets of item sets and are denoted by relating one item set on the left-hand side (LHS) of the rule to another item set on the right-hand side (RHS) of the rule. The LHS is the condition that needs to be met in order to trigger the rule, and the RHS is the expected result of meeting that condition.

{peanut butter, jelly} \(\rightarrow\) {bread}

In plain language, this association rule states that if peanut butter and jelly are purchased together, then bread is also likely to be purchased. In other words, “peanut butter and jelly imply bread.”

Association rules are not used for prediction, but rather for unsupervised knowledge discovery in large databases. Because association rule learners are unsupervised, there is no need for the algorithm to be trained; data does not need to be labeled ahead of time. The program is simply unleashed on a data set in the hope that interesting associations are found. The downside, of course, is that there isn’t an easy way to objectively measure the performance of a rule learner.

Measuring Rule Interest

Whether or not an association rule is deemed interesting is determined by two statistical measures: support and confidence. Assume there are 100 customers. 10 out of them bought milk, 8 bought butter and 6 bought both of them. Consider bought milk => bought butter. We have that the:

\(\text{Support} = P(\text{Milk & Butter}) = 6/100 = 0.06\)
\(\text{confidence} = \frac{\text{support}}{P(\text{Butter})} = 0.06/0.08 = 0.75\)
\(\text{lift} = \frac{\text{confidence}}{P(\text{Milk})} = 0.75/0.10 = 7.5\)

The support of an item set or rule measures how frequently it occurs in the data. A rule’s confidence is a measurement of its predictive power or accuracy. Rules with high support and high confidence are known as strong rules.

Setup

We start by loading the packages.

Lets read in our data.

Data Description and Preprocessing

The data set we are using today comes from UCI Machine Learning repository. The data-set is called “Online Retail” and can be found here. It contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers

An extract of the data is shown below.

The variables (columns) in our data frame are described as follows:

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.

The structure of the variables is given below.

## tibble [541,909 × 8] (S3: tbl_df/tbl/data.frame)
##  $ InvoiceNo  : chr [1:541909] "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr [1:541909] "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr [1:541909] "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : num [1:541909] 6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: POSIXct[1:541909], format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
##  $ UnitPrice  : num [1:541909] 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : num [1:541909] 17850 17850 17850 17850 17850 ...
##  $ Country    : chr [1:541909] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...

We see we have over 540 000 transactions (data points), with the 8 variables. Let’s see how many missing data we have for each variable.

##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0        1454           0           0           0 
##  CustomerID     Country 
##      135080           0

We decide to remove these observations with missing data. We also make a new description (factor) variable, as well as country (factor) variable. Invoice date is made into a date variable too, and formatted to an appropriate structure. Invoice number is changed to numeric.

After pre-processing, the data set includes 406,829 records and 10 fields: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country, Date, Time.

Data Exploration

Let’s start by exploring the time effect in the retail store. That is, let’s see what time do people often purchase online. We need to extract “hour” from the time column. We use the hms() function which transforms a character or numeric vector into a period object with the specified number of hours, minutes, and seconds.

There is a clear effect of hour of day on order volume. Most orders happened between 11:00-15:00. Let’s investigate how many items each customer buys on average each transaction. People mostly purchase less than 10 items (less than 10 items in each invoice).

As mentioned, very few people appear to buy more than, say, 30 items. Given the aim of this project, a key interest would be the top 10 items in our data set.

Let’s visualize this.

Associations Rules Set Up

Before applying any rule mining algorithm, we transform data from the data frame format into transactions such that we have all the items bought together in one row. This is shown below.

The function ddply() accepts a data frame, splits it into pieces based on one or more factors, computes on the pieces, then returns the results as a data frame. We use “,” to separate different items. So what we have now is for each customer, a variable that contains a list of all the items that customer purchased. We only need item transactions, so, remove customerID and Date columns.

Write the data from to a csv file and check whether our transaction format is correct. We can do this manually.

So now we have our transaction data-set shows the matrix of items being bought together. We don’t actually see how often they are bought together, we don’t see rules either. This is the next step for us.

We see 19,296 transactions, this is the number of rows as well, and 27,185 items -items are the product descriptions in our original data set. Transaction here is the collections or subsets of these 27,165 items.

The most frequent items should be same with our results in Figure 3.
For the sizes of the transactions, 2247 transactions for just 1 items, 1147 transactions for 2 items, all the way up to the biggest transaction: 1 transaction for 420 items. This indicates that most customers buy small number of items on each purchase.

We have already seen that the data distribution is right skewed. Let’s have a look item frequency plot, this should be in align with the plot we achieved in our data exploration (most commonly purchased items).

Creating Rules

We shall use the Apriori algorithm in arules library to mine frequent item sets and association rules. The algorithm employs level-wise search for frequent item sets.

We need to set threshold values. We pass \(\text{supp}=0.001\) and \(\text{conf}=0.8\) to return all the rules have a support of at least \(0.1\%\) and confidence of at least \(80\%\). We sort the rules by decreasing confidence. let’s have a look the summary of the rules.

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 19 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[27165 item(s), 19297 transaction(s)] done [0.13s].
## sorting and recoding items ... [2407 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.46s].
## writing ... [87110 rule(s)] done [0.05s].
## creating S4 object  ... done [0.03s].

## set of 87110 rules
## 
## rule length distribution (lhs + rhs):sizes
##     2     3     4     5     6     7     8     9    10 
##   105  3133  9732 26228 29873 14020  3218   680   121 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   5.000   6.000   5.627   6.000  10.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.001036   Min.   :0.8000   Min.   :0.001036   Min.   :  8.781  
##  1st Qu.:0.001088   1st Qu.:0.8333   1st Qu.:0.001244   1st Qu.: 19.305  
##  Median :0.001192   Median :0.8750   Median :0.001399   Median : 24.786  
##  Mean   :0.001383   Mean   :0.8834   Mean   :0.001572   Mean   : 50.921  
##  3rd Qu.:0.001503   3rd Qu.:0.9231   3rd Qu.:0.001658   3rd Qu.: 43.662  
##  Max.   :0.018086   Max.   :1.0000   Max.   :0.021765   Max.   :622.484  
##      count       
##  Min.   : 20.00  
##  1st Qu.: 21.00  
##  Median : 23.00  
##  Mean   : 26.69  
##  3rd Qu.: 29.00  
##  Max.   :349.00  
## 
## mining info:
##  data ntransactions support confidence
##    tr         19297   0.001        0.8
##                                                            call
##  apriori(data = tr, parameter = list(supp = 0.001, conf = 0.8))

The number of rules: 87 110.
The distribution of rules by length: Most rules are 6 items long.
The summary of quality measures: ranges of support, confidence, and lift.

We have 87 110. rules, We won’t print them all but instead look at the top 10 or so.

##      lhs                         rhs             support     confidence
## [1]  {WOBBLY CHICKEN}         => {DECORATION}    0.001451003 1         
## [2]  {WOBBLY CHICKEN}         => {METAL}         0.001451003 1         
## [3]  {DECOUPAGE}              => {GREETING CARD} 0.001191895 1         
## [4]  {BILLBOARD FONTS DESIGN} => {WRAP}          0.001502824 1         
## [5]  {WOBBLY RABBIT}          => {DECORATION}    0.001761932 1         
## [6]  {WOBBLY RABBIT}          => {METAL}         0.001761932 1         
## [7]  {BLACK TEA}              => {SUGAR JARS}    0.002331969 1         
## [8]  {BLACK TEA}              => {COFFEE}        0.002331969 1         
## [9]  {ART LIGHTS}             => {FUNK MONKEY}   0.001969218 1         
## [10] {FUNK MONKEY}            => {ART LIGHTS}    0.001969218 1         
##      coverage    lift      count
## [1]  0.001451003 385.94000 28   
## [2]  0.001451003 385.94000 28   
## [3]  0.001191895 344.58929 23   
## [4]  0.001502824 622.48387 29   
## [5]  0.001761932 385.94000 34   
## [6]  0.001761932 385.94000 34   
## [7]  0.002331969 212.05495 45   
## [8]  0.002331969  61.06646 45   
## [9]  0.001969218 507.81579 38   
## [10] 0.001969218 507.81579 38

We get a nice output of the LHS and RHS of rules for item sets, with corresponding measures. We can deduce the following:

100% customers who bought “WOBBLY CHICKEN” end up bought “DECORATION” as well.
100% customers who bought “BLACK TEA” end up bought “SUGAR JAR” as well.

And plot these top 10 rules.

Ideally we want strong rules, i.e., rules with both high support and confidence.

Using these plots we can examine the top 10 rules from our association rules algorithm. Moreover retailers can use this information to create consumer/data driven decisions.

References

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management.
R and Data Mining
AnalyticsVidya