Association rules - market basket analysis

Introduction

Understanding the behavior of consumers is a key to success for many companies.There are some relations between the purchased good, i.e. the purchase of one good drives people to the purchase of another; knowledge of such patterns may help in creating effective sale strategy. This report aims to reveal purchasing patterns of consumers. For this purpose we will analyze market basket data with association rules. The rules will be analysed using apriori algorithm. The apriori algorithm uses three measures:
- Support - telling how often an item or a rule appears in the data set
- Confidence - share of transactions where presence of one item is followed by the presence of another specific item
- Lift - informs about the association between two items. A value grater than 1 suggests positive association between items, lower than 1 - negative association and close to 1 implies lack of dependency.

Data

Dataset provides infromation about 2000 store transactions. Each row in the data represents one market basket. 42 columns stand for 42 different product. The data comes from Kaggle (https://www.kaggle.com/arronlacey/market-basket-analysis?select=market_basket_analysis.csv). Before applying the Apriori algorithm on the data set, we will transform the data from matix format to basket format and try to learn more about the transactions. Then the itemFrequencyPlot() function to create bar plots will be used to view the distribution of the products.

summary(market)

## transactions as itemMatrix in sparse format with
##  2000 rows (elements/itemsets/transactions) and
##  43 columns (items) and a density of 0.1183023 
## 
## most frequent items:
##   pizza    mars    coke lasagna    twix (Other) 
##     491     486     482     433     425    7857 
## 
## element (itemset/transaction) length distribution:
## sizes
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  16  17 
## 124 324 275 247 224 219 173 128 107  78  57  31   9   2   1   1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   5.087   7.000  17.000 
## 
## includes extended item information - examples:
##   labels
## 1    7up
## 2    bbq
## 3  bread

length(market)

## [1] 2000

Item frequency plots

# limit the plot to 20 items
#absolute
itemFrequencyPlot(market, topN = 20, type="absolute", col ="purple", cex.names=.8, main="Item frequency - absolute")

#relative
itemFrequencyPlot(market, topN = 20, type="relative", col ="purple", cex.names=.8, main="Item frequency - relative")

We see that pizza, mars and coke present the highest frequency item, people most often buy them.

Apriori algorithm

Now we will try to find association rules using apriori algorithm. We will set min support to 0.015, so that the pair of products is bought by at least 30 (0.01*2000) people; and the confidence, that when the person bought product X, product Y will also be bought to 65%. Th minimum length of a rule is 2 elements.

marketrules <- apriori(market, parameter = list(support = 0.015, confidence = 0.65, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.65    0.1    1 none FALSE            TRUE       5   0.015      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 30 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[43 item(s), 2000 transaction(s)] done [0.00s].
## sorting and recoding items ... [42 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.06s].
## writing ... [35 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

marketrules

## set of 35 rules

summary(marketrules)

## set of 35 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
##  1 34 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   2.971   3.000   3.000 
## 
## summary of quality measures:
##     support         confidence        coverage            lift      
##  Min.   :0.0150   Min.   :0.6522   Min.   :0.01900   Min.   :2.660  
##  1st Qu.:0.0160   1st Qu.:0.6638   1st Qu.:0.02350   1st Qu.:2.755  
##  Median :0.0175   Median :0.6852   Median :0.02450   Median :2.910  
##  Mean   :0.0196   Mean   :0.6989   Mean   :0.02829   Mean   :3.040  
##  3rd Qu.:0.0195   3rd Qu.:0.7205   3rd Qu.:0.02850   3rd Qu.:3.086  
##  Max.   :0.0805   Max.   :0.8421   Max.   :0.12200   Max.   :4.459  
##      count      
##  Min.   : 30.0  
##  1st Qu.: 32.0  
##  Median : 35.0  
##  Mean   : 39.2  
##  3rd Qu.: 39.0  
##  Max.   :161.0  
## 
## mining info:
##    data ntransactions support confidence
##  market          2000   0.015       0.65
##                                                                                      call
##  apriori(data = market, parameter = list(support = 0.015, confidence = 0.65, minlen = 2))

The total number of rules is 35.

Inspecting rules

# reorder the rules so that we are able to inspect the most meaningful ones
inspect(sort(marketrules, by = "confidence")[1:10])

##      lhs                          rhs     support confidence coverage lift    
## [1]  {peas, pepsi}             => {coke}  0.0160  0.8421053  0.0190   3.494213
## [2]  {chicken.tikka, potatoes} => {pizza} 0.0190  0.7916667  0.0240   3.224711
## [3]  {7up, milk}               => {coke}  0.0170  0.7906977  0.0215   3.280903
## [4]  {7up, potatoes}           => {coke}  0.0175  0.7777778  0.0225   3.227294
## [5]  {bulmers, lasagna}        => {pizza} 0.0230  0.7419355  0.0310   3.022140
## [6]  {newspaper, pepsi}        => {coke}  0.0155  0.7380952  0.0210   3.062636
## [7]  {bread, chicken.tikka}    => {pizza} 0.0180  0.7346939  0.0245   2.992643
## [8]  {pepsi, potatoes}         => {coke}  0.0205  0.7321429  0.0280   3.037937
## [9]  {fosters, twix}           => {mars}  0.0155  0.7209302  0.0215   2.966791
## [10] {pepsi, twix}             => {coke}  0.0180  0.7200000  0.0250   2.987552
##      count
## [1]  32   
## [2]  38   
## [3]  34   
## [4]  35   
## [5]  46   
## [6]  31   
## [7]  36   
## [8]  41   
## [9]  31   
## [10] 36

inspect(sort(marketrules, by = "lift")[1:10])

##      lhs                          rhs       support confidence coverage
## [1]  {lasagna, red.wine}       => {bulmers} 0.0160  0.6666667  0.0240  
## [2]  {bread, tea}              => {cheese}  0.0155  0.6595745  0.0235  
## [3]  {ham, mayonnaise}         => {cheese}  0.0160  0.6530612  0.0245  
## [4]  {instant.coffee, mars}    => {milk}    0.0195  0.6724138  0.0290  
## [5]  {peas, pepsi}             => {coke}    0.0160  0.8421053  0.0190  
## [6]  {7up, milk}               => {coke}    0.0170  0.7906977  0.0215  
## [7]  {7up, potatoes}           => {coke}    0.0175  0.7777778  0.0225  
## [8]  {chicken.tikka, potatoes} => {pizza}   0.0190  0.7916667  0.0240  
## [9]  {instant.coffee, pizza}   => {lasagna} 0.0175  0.6730769  0.0260  
## [10] {newspaper, pepsi}        => {coke}    0.0155  0.7380952  0.0210  
##      lift     count
## [1]  4.459309 32   
## [2]  4.071447 31   
## [3]  4.031242 32   
## [4]  3.664380 39   
## [5]  3.494213 32   
## [6]  3.280903 34   
## [7]  3.227294 35   
## [8]  3.224711 38   
## [9]  3.108900 35   
## [10] 3.062636 31

We see that 84.21% of people who bought peas and pepsi also bought coke, and over 79% people who bought chicken.tikka and potatoes also bought pizza. The lift measure is the highest for ham, pizza and cheese which implies high association between them.

Association rules graphs

Scatter-Plot

library(arulesViz)
library(plotly)
plot(marketrules, engine="plotly")

The above plot shows that rules with high lift have low confidence.

Two-key plot

plot(marketrules, method = "two-key plot", engine="plotly")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

The above two-key plot shows also the number of items in the rule. For most rules there are 3 items.

Grouped matrix-based visualization

plot(marketrules, method="grouped", control=list(reorder=TRUE))

## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## groups    =  NULL
## engine    =  ggplot2
## verbose   =  FALSE

The above balloon plot shows the antecedent groups (LHS) as columns and consequents (RHS) as rows. The group which contains the most important rules according to lift are shown in the leftmost column. The group contains 1 rule with one positive consequent - bulmers.

library(arulesViz)

plot(marketrules, method="graph", measure="support", shading="lift", main = "Association Rules Graph")

## Warning: Unknown control parameters: main

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

The above graph shows revealed rules between products. The arrows show the relation between categories. The size and color of vertices often represent lift and support, respectively.

Parallel Coordinates Plot

marketrules2<-head(marketrules, n=10, by="lift")
plot(marketrules2, method="paracoord", control=list(reorder=TRUE))

The positions are in the LHS where 2 is the most recent addition to basket and 1 is the item people previously had. The Parallel Coordinates Plot indicates that when people buy mayonnaise and ham, they are also likely to buy cheese.

Finding rules related to specific items

Now we will check what drives people to buy pizaa and what else people will by if they already have pizza in their basket.

What drives people to buy root vegetables?

rules.rootveg<-apriori(data=market, parameter=list(supp=0.01,conf = 0.005), 
                       appearance=list(default="lhs", rhs="pizza"), control=list(verbose=F)) 
rules.rootveg.byconf<-sort(rules.rootveg, by="confidence", decreasing=TRUE)
inspect(head(rules.rootveg.byconf))

##     lhs                                   rhs     support confidence coverage
## [1] {milk, soup}                       => {pizza} 0.0110  0.8148148  0.0135  
## [2] {chicken.tikka, lasagna, potatoes} => {pizza} 0.0100  0.8000000  0.0125  
## [3] {bread, cheese, chicken.tikka}     => {pizza} 0.0115  0.7931034  0.0145  
## [4] {chicken.tikka, potatoes}          => {pizza} 0.0190  0.7916667  0.0240  
## [5] {carrots, chicken.tikka}           => {pizza} 0.0115  0.7666667  0.0150  
## [6] {chicken.tikka, instant.coffee}    => {pizza} 0.0115  0.7666667  0.0150  
##     lift     count
## [1] 3.319001 22   
## [2] 3.258656 20   
## [3] 3.230564 23   
## [4] 3.224711 38   
## [5] 3.122878 23   
## [6] 3.122878 23

Before buying pizza customers mostly buy milk and soup or chicken.tikka, lasagna, potatoes.

What else will consumers buy if they have pizza in their basket?

rules.rootvegopp<-apriori(data=market, parameter=list(supp=0.01,conf = 0.005), 
                          appearance=list(default="rhs", lhs="pizza"), control=list(verbose=F)) 
rules.rootvegopp.byconf<-sort(rules.rootvegopp, by="confidence", decreasing=TRUE)
inspect(head(rules.rootvegopp.byconf))

##     lhs        rhs             support confidence coverage lift     count
## [1] {pizza} => {lasagna}       0.1370  0.5580448  0.2455   2.577574 274  
## [2] {pizza} => {chicken.tikka} 0.1015  0.4134420  0.2455   2.552111 203  
## [3] {}      => {mars}          0.2430  0.2430000  1.0000   1.000000 486  
## [4] {}      => {coke}          0.2410  0.2410000  1.0000   1.000000 482  
## [5] {}      => {lasagna}       0.2165  0.2165000  1.0000   1.000000 433  
## [6] {}      => {twix}          0.2125  0.2125000  1.0000   1.000000 425

People who bought pizza are also likely to buy lasagna and chicken.tikka

Conclusions

The association rules is an extremely useful tool in studying patterns of behavior and can be applied not only in market basket analysis. The results of the apriori algorithm used in this report is easy to understand and interpret. Another advantage is a good operation of the algorithm with large data sets enabling to extract useful information that is usually difficult when we have many dimensions. For the analysed dataset the behavior of customers somehow reflects their eating habits, sometimes they also suggest an intention to prepare a specific meal.

Resources

https://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf
https://www.datacamp.com/community/tutorials/market-basket-analysis-r