Market Basket Analysis Using Association Rules

Step 1: Collecting Data

The market basket analysis using the association rules is based on the dataset collected from a one month operation of a real-world grocery store. The dataset can be obtained from Prof. Eric Suess website at http://www.sci.csueastbay.edu/~esuess/stat6620/ for or the http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv link.

Step 2: Exploring & Preparing the Data

Unlike the usual way when a dataset is read into Rstudio using read.table() or read.csv() function. The read.trasactions() function is used under the arules package in order to read the groceries dataset into a sparsed matrix ready for analysis with association rules. Here the sparsed matrix is needed to accomodate the dynamic change of the itemsets being added into the groceries data. The sparsed matrix allows each observation/row as a transactional data which contain items purchased together per transaction, and the columns define as all possible unique item features shown in the grocies dataset. The sparsed matrix contains 9835 transactions (rows) and 169 (columns) unique items bought by customer in the one month period.

Using summary(), the sparsed matrix detail can be seen below. For example, 2513 out of 9835 transactions contain whole milk, while 1809 out of 9835 transactions contain rolls/buns. There are 2159 transactions that contain only 1 item purchased, and only 1 transaction with 32 unique items bought.

library(arules)

## Warning: package 'arules' was built under R version 3.3.3

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

groceries <- read.transactions("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv", sep = ",")
summary(groceries)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Using the inspect() for the sparsed matrix, the list of first 5 transactions can be seen below.

inspect(groceries[1:5])

##     items                     
## [1] {citrus fruit,            
##      margarine,               
##      ready soups,             
##      semi-finished bread}     
## [2] {coffee,                  
##      tropical fruit,          
##      yogurt}                  
## [3] {whole milk}              
## [4] {cream cheese,            
##      meat spreads,            
##      pip fruit,               
##      yogurt}                  
## [5] {condensed milk,          
##      long life bakery product,
##      other vegetables,        
##      whole milk}

Using an itemFrequency() on the sprased matrix, the first three items with names arranged alphabetically in the sprased matrix is shown below with their respective occurance frequency.

itemFrequency(groceries[, 1:3])

## abrasive cleaner artif. sweetener   baby cosmetics 
##     0.0035587189     0.0032536858     0.0006100661

Using an itemFrequencyPlot() function on the sparsed matrix, the relative frequency of each items can be seen as barchart. It can be specified by the support value, meaning the number of transactions that contain {x} over the total possible transactions. Here the support = 0.1 means that items with greater than or equal to 10% frequency will be shown in the barchart.

itemFrequencyPlot(groceries, support = 0.1)

Another way to specify the parameter in the itemFrequencyPlot() is the topN. It specifies to show the top 20 items in frequency from the groceries data. This way the items will be arranged in descending order by default.

itemFrequencyPlot(groceries, topN = 20)

Using an image() on the sprased matrix, the first 5 transactions in the groceries data is observed. Since columns represents unique items bought by the customer per transaction, the black dot shown in the plot will indicates specific items bought per customer.

image(groceries[1:5])

100 random sampling of the total transactions in the groceries dataset is taken for image, and it can be used to observed for specifc pattern of popular items bought during any possible holiday time.

image(sample(groceries, 100))

Step 3 Model Training for Data

Using an apriori() function on the groceries sparsed matrix object, with specific support and confidence values as well as number of items specified in each itemset. An apriori object, groceryrules is created. This is an object contains 463 rules each with minimum of 2 items.

groceryrules <- apriori(groceries, parameter = list(support =
                          0.006, confidence = 0.25, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

groceryrules

## set of 463 rules

Step 4: Model Performance Evaluation

To see the summary of the rules object, a summary() is used. It contains information for total of 463 rules generated by the specific requirements in the previous apriori(), and splitting down into three possible itemset length. There are 150 rules for 2 items, 297 rules for 3 items and 16 rules for 4 items.

summary(groceryrules)

## set of 463 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 150 297  16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.711   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence          lift       
##  Min.   :0.006101   Min.   :0.2500   Min.   :0.9932  
##  1st Qu.:0.007117   1st Qu.:0.2971   1st Qu.:1.6229  
##  Median :0.008744   Median :0.3554   Median :1.9332  
##  Mean   :0.011539   Mean   :0.3786   Mean   :2.0351  
##  3rd Qu.:0.012303   3rd Qu.:0.4495   3rd Qu.:2.3565  
##  Max.   :0.074835   Max.   :0.6600   Max.   :3.9565  
## 
## mining info:
##       data ntransactions support confidence
##  groceries          9835   0.006       0.25

Using the inspect() to examine the first three rules below. Each rule contains information of {x} –> {y} relationship with their respective support, confidence and life values. Support as mentioned previously, measurees how freuquently it occures in the data; confidence is a measure of its predictive power or accuracy, and life is a rule measures how much more likely one item or itemset is purchased relative to its typical rate of purchase.

For example, the first rule states that customer who bought potted plants is more likely to buy whole milk. The support shows that 0.6% of time the potted plants is purchased in all transaction of the groceries data, and condidence indicates that 40% of the transaction where the present of potted plants results in the presence of whole milk. The life shows that customer who purchased the potted plants is 1.56 times more likely to purchase the whole milk. A larger lift value is a strong indicator that arule is important and reflects a true connection between the items in the rules.

inspect(groceryrules[1:3])

##     lhs                rhs               support     confidence lift    
## [1] {potted plants} => {whole milk}      0.006914082 0.4000000  1.565460
## [2] {pasta}         => {whole milk}      0.006100661 0.4054054  1.586614
## [3] {herbs}         => {root vegetables} 0.007015760 0.4312500  3.956477

Step 5: Model Performance Improvement

A useful way to examine the association rules set is to look at rules with higher lift because a larger life value indicates a strong connection between items that were hidden previously. For example, a rule with the highest lift in the dataset is one that associate herbs with the root vegatables; a customer who bought herbs is almost 4 times likely to purchase root vegatables than a typical customer.

inspect(sort(groceryrules, by = "lift")[1:5])

##     lhs                   rhs                      support confidence     lift
## [1] {herbs}            => {root vegetables}    0.007015760  0.4312500 3.956477
## [2] {berries}          => {whipped/sour cream} 0.009049314  0.2721713 3.796886
## [3] {other vegetables,                                                        
##      tropical fruit,                                                          
##      whole milk}       => {root vegetables}    0.007015760  0.4107143 3.768074
## [4] {beef,                                                                    
##      other vegetables} => {root vegetables}    0.007930859  0.4020619 3.688692
## [5] {other vegetables,                                                        
##      tropical fruit}   => {pip fruit}          0.009456024  0.2634561 3.482649

Another useful way to examine rules in the data is by looking at specific subset of rules which contains {berries}. This can be useful if one were asked to create advertisement to promote berries at a particular season. The subset() is then used with the rules object to look at rules that specifically contains {berries}. By looking at the list of rules related to {berries} with highest lift value. One can see that customers who bought berries is 3.8 times likely to buy whipped cream/sour cream, and 2.3 times more likely to buy yogurt. After these information is revealed, one can understand that maybe the berries and cream/yogurt is a good combination for dessert.

berryrules <- subset(groceryrules, items %in% "berries")
inspect(berryrules)

##     lhs          rhs                  support     confidence lift    
## [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713  3.796886
## [2] {berries} => {yogurt}             0.010574479 0.3180428  2.279848
## [3] {berries} => {other vegetables}   0.010269446 0.3088685  1.596280
## [4] {berries} => {whole milk}         0.011794611 0.3547401  1.388328

Finally, the rules object can be saved as a csv. file or a dataframe for future analysis.

write(groceryrules, file = "groceryrules.csv",
      sep = ",", quote = TRUE, row.names = FALSE)

groceryrules_df <- as(groceryrules, "data.frame")
str(groceryrules_df)

## 'data.frame':    463 obs. of  4 variables:
##  $ rules     : Factor w/ 463 levels "{baking powder} => {other vegetables}",..: 340 302 207 206 208 341 402 21 139 140 ...
##  $ support   : num  0.00691 0.0061 0.00702 0.00773 0.00773 ...
##  $ confidence: num  0.4 0.405 0.431 0.475 0.475 ...
##  $ lift      : num  1.57 1.59 3.96 2.45 1.86 ...

Conclusion: The Association Rules is an unsupervised learning in which no training is needed and thus no holdout method needed. Its purpose is to used to find patterns and association between different products (items) for making actionable business decision. The groceries dataset is used for the Association Rules analysis and is first turned as a sparsed matrix so that the rows is representing number of transactions and column represent possibile unique items in the grocery store. The value in the sparsed matrix is only between 1, for the items being bought in a particular transaction; and 0 otherwise. Various plots are useful to know high frequency purchased products and its association patterns. At the end, various rules are generated to see how buying one or more items increase the chance of buying another items. This can be explained as support, confidence and lift statistics.