Market Basket Analysis

Introduction
- Market Basket Analysis Methodology
- Dataset description
Rules
- Whole milk rules
- Jaccard Index, Affinity measure.
Conclusions

Introduction

In this paper I will implement Marekt Basket Analysis on real data collected from kaggle: https://www.kaggle.com/heeraldedhia/groceries-dataset. The main goal of this paper is to examine this dataset’s characteristics

Market Basket Analysis Methodology

The goal of market basket analysis is to create set of ‘strong’ rules that are applied along data transactions. Transaction can be understood as set of items bought by single person in given moment. Basing on the rules found along dataset some statements may be formulated.

Dataset description

The selected dataset contains of 3 variables and 38765 observations. The key in dataset is Member_number + Date, this combination will be called a transaction. The data is already groupped into categories like ‘beef’, ‘sausage’ etc. That part of the process must be done by data preparation team. There are ways to group product names by their name using wide range of text mining tools like stimization and clustering but for purpose of that paper data is already groupped.

df = read.csv('Groceries_dataset.csv', header=1)
head(df, 10)

##    Member_number       Date  itemDescription
## 1           1808 21-07-2015   tropical fruit
## 2           2552 05-01-2015       whole milk
## 3           2300 19-09-2015        pip fruit
## 4           1187 12-12-2015 other vegetables
## 5           3037 01-02-2015       whole milk
## 6           4941 14-02-2015       rolls/buns
## 7           4501 08-05-2015 other vegetables
## 8           3803 23-12-2015       pot plants
## 9           2762 20-03-2015       whole milk
## 10          4119 12-02-2015   tropical fruit

There are in total 14963 transactions.

xx = df %>% group_by(Member_number, Date) %>% count()
xx1 = xx %>% filter(n<200)
hist(xx1$n, main='Histogram of number of products in basket')

summary(xx1$n)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.591   3.000  11.000

Highly positively skewed distribution suggest some sort of following Bernulli distribution, what suggest interpretation that each next product bought by customer is a success.

xd1 = arules::sort(table(df$itemDescription), decreasing=TRUE)
plot(xd1[xd1>500],las = 2,
        cex.names = 0.3)

Rules

In order to create rules along dataset we need to transform dataframe into transitions object.

grouping_for_AA <- df %>%
                   group_by(Member_number,  itemDescription) %>%
                   dplyr::select(Member_number,  itemDescription, Date) %>%
                   data.frame()
trans <- as(split(grouping_for_AA[,"itemDescription"], grouping_for_AA[,"Member_number"]), "transactions")

## Warning in asMethod(object): removing duplicated items in transactions

Now, the items frequency may be calculated.

freq_items<-eclat(trans, parameter=list(supp=0.001, maxlen=15))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.001      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 3 
## 
## create itemset ... 
## set transactions ...[167 item(s), 3898 transaction(s)] done [0.00s].
## sorting and recoding items ... [164 item(s)] done [0.00s].
## creating sparse bit matrix ... [164 row(s), 3898 column(s)] done [0.00s].
## writing  ... [217551 set(s)] done [0.12s].
## Creating S4 object  ... done [0.07s].

freq_rules<-ruleInduction(freq_items, trans, confidence=0.3)
summary(freq_rules)

## set of 533867 rules
## 
## rule length distribution (lhs + rhs):sizes
##      2      3      4      5      6      7      8 
##    866  30881 170025 226520  92390  12609    576 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   5.000   4.784   5.000   8.000 
## 
## summary of quality measures:
##     support           confidence          lift            itemset      
##  Min.   :0.001026   Min.   :0.3000   Min.   : 0.6548   Min.   :     1  
##  1st Qu.:0.001026   1st Qu.:0.4000   1st Qu.: 1.5873   1st Qu.: 67957  
##  Median :0.001283   Median :0.5000   Median : 2.0804   Median :123656  
##  Mean   :0.001654   Mean   :0.5591   Mean   : 2.4271   Mean   :119349  
##  3rd Qu.:0.001539   3rd Qu.:0.6667   3rd Qu.: 2.7883   3rd Qu.:174775  
##  Max.   :0.191380   Max.   :1.0000   Max.   :74.2476   Max.   :217387  
## 
## mining info:
##   data ntransactions support
##  trans          3898   0.001
##                                                              call confidence
##  eclat(data = trans, parameter = list(supp = 0.001, maxlen = 15))        0.3

inspect(head(sort(freq_rules, by ="lift"),10))

##      lhs                            rhs                          support confidence     lift itemset
## [1]  {chicken,                                                                                      
##       domestic eggs,                                                                                
##       rolls/buns,                                                                                   
##       soda,                                                                                         
##       whole milk}                => {cereals}                0.001026167  0.8000000 74.24762    2083
## [2]  {bottled water,                                                                                
##       long life bakery product,                                                                     
##       root vegetables,                                                                              
##       sausage}                   => {house keeping products} 0.001026167  0.6666667 57.74815    3829
## [3]  {bottled water,                                                                                
##       long life bakery product,                                                                     
##       root vegetables,                                                                              
##       whole milk}                => {house keeping products} 0.001026167  0.5000000 43.31111    3832
## [4]  {bottled beer,                                                                                 
##       bottled water,                                                                                
##       curd,                                                                                         
##       pork,                                                                                         
##       whole milk}                => {curd cheese}            0.001026167  0.5000000 42.36957    4549
## [5]  {chocolate,                                                                                    
##       pickled vegetables,                                                                           
##       pip fruit,                                                                                    
##       whole milk}                => {salt}                   0.001026167  0.8000000 35.03820   15664
## [6]  {chocolate,                                                                                    
##       pickled vegetables,                                                                           
##       pip fruit}                 => {salt}                   0.001026167  0.8000000 35.03820   15666
## [7]  {domestic eggs,                                                                                
##       hamburger meat,                                                                               
##       shopping bags,                                                                                
##       soda}                      => {canned fish}            0.001026167  1.0000000 33.89565   21715
## [8]  {chicken,                                                                                      
##       domestic eggs,                                                                                
##       rolls/buns,                                                                                   
##       soda}                      => {cereals}                0.001026167  0.3636364 33.74892    2085
## [9]  {chicken,                                                                                      
##       domestic eggs,                                                                                
##       rolls/buns,                                                                                   
##       whole milk}                => {cereals}                0.001026167  0.3636364 33.74892    2086
## [10] {cream cheese ,                                                                                
##       domestic eggs,                                                                                
##       fruit/vegetable juice,                                                                        
##       other vegetables}          => {photo/film}             0.001026167  0.6666667 33.74892   11541

These are top 10 rules basing on lift value, lets try to plot all of the rules recognized in the dataset.

plot(freq_rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE)

## Warning in plot.rules(freq_rules, measure = c("support", "confidence"), : The
## parameter interactive is deprecated. Use engine='interactive' instead.

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

rules_for_plot = head(sort(sort(freq_rules, by ="confidence"),by="support"),10)
plot(rules_for_plot, method="paracoord")

The parallel plot is the plot to visualize which product results in which.

Whole milk rules

Whole milk, as one of the products

rules.wholemilk<-apriori(data=trans, parameter=list(supp=0.001, conf=0.08), appearance=list(default="lhs", rhs="whole milk"), control=list(verbose=F)) 
rules.wholemilk<-sort(rules.wholemilk, by="confidence", decreasing=TRUE)
inspect(head(rules.wholemilk))

##     lhs                                  rhs          support     confidence
## [1] {whisky}                          => {whole milk} 0.002052335 1         
## [2] {frozen vegetables, hair spray}   => {whole milk} 0.001026167 1         
## [3] {frozen fruits, other vegetables} => {whole milk} 0.001026167 1         
## [4] {bottled water, whisky}           => {whole milk} 0.001282709 1         
## [5] {root vegetables, whisky}         => {whole milk} 0.001539251 1         
## [6] {rolls/buns, whisky}              => {whole milk} 0.001026167 1         
##     coverage    lift     count
## [1] 0.002052335 2.182531 8    
## [2] 0.001026167 2.182531 4    
## [3] 0.001026167 2.182531 4    
## [4] 0.001282709 2.182531 5    
## [5] 0.001539251 2.182531 6    
## [6] 0.001026167 2.182531 4

We can extract important rules out of the set above.

rules.wholemilk2 = rules.wholemilk[is.significant(rules.wholemilk, trans)]
inspect(rules.wholemilk2)

##     lhs                                       rhs          support   
## [1] {other vegetables, rolls/buns, yogurt} => {whole milk} 0.03437660
## [2] {bottled water, other vegetables}      => {whole milk} 0.05618266
## [3] {other vegetables, yogurt}             => {whole milk} 0.07183171
## [4] {rolls/buns, yogurt}                   => {whole milk} 0.06593125
## [5] {other vegetables, rolls/buns}         => {whole milk} 0.08209338
## [6] {yogurt}                               => {whole milk} 0.15059005
##     confidence coverage   lift     count
## [1] 0.6568627  0.05233453 1.433623 134  
## [2] 0.5983607  0.09389430 1.305941 219  
## [3] 0.5970149  0.12031811 1.303003 280  
## [4] 0.5921659  0.11133915 1.292420 257  
## [5] 0.5594406  0.14674192 1.220996 320  
## [6] 0.5321850  0.28296562 1.161510 587

plot(rules.wholemilk2, method="graph",control = list(cex=0.7))

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

The plot may show relations between products and their impact.

Jaccard Index, Affinity measure.

For detaile definition please visit: Jaccard Index: https://en.wikipedia.org/wiki/Jaccard_index

Affinity Measure: https://en.wikipedia.org/wiki/Affinity_analysis

trans.sel<-trans[,itemFrequency(trans)>0.2]
jac<-dissimilarity(trans.sel, which="items") 
round(jac,digits=3)

##                  bottled water other vegetables rolls/buns root vegetables
## other vegetables         0.811                                            
## rolls/buns               0.836            0.747                           
## root vegetables          0.861            0.817      0.814                
## sausage                  0.867            0.810      0.826           0.852
## soda                     0.831            0.781      0.780           0.824
## tropical fruit           0.856            0.824      0.822           0.859
## whole milk               0.799            0.703      0.716           0.803
## yogurt                   0.846            0.777      0.786           0.838
##                  sausage  soda tropical fruit whole milk
## other vegetables                                        
## rolls/buns                                              
## root vegetables                                         
## sausage                                                 
## soda               0.825                                
## tropical fruit     0.858 0.824                          
## whole milk         0.808 0.757          0.798           
## yogurt             0.818 0.805          0.828      0.745

a = affinity(trans.sel)
round(a, digits=3)

## An object of class "ar_similarity"
##                  bottled water other vegetables rolls/buns root vegetables
## bottled water            0.000            0.189      0.164           0.139
## other vegetables         0.189            0.000      0.253           0.184
## rolls/buns               0.164            0.253      0.000           0.186
## root vegetables          0.139            0.184      0.186           0.000
## sausage                  0.133            0.190      0.174           0.148
## soda                     0.169            0.219      0.220           0.176
## tropical fruit           0.144            0.176      0.178           0.141
## whole milk               0.201            0.297      0.284           0.197
## yogurt                   0.154            0.223      0.214           0.162
##                  sausage  soda tropical fruit whole milk yogurt
## bottled water      0.133 0.169          0.144      0.201  0.154
## other vegetables   0.190 0.219          0.176      0.297  0.223
## rolls/buns         0.174 0.220          0.178      0.284  0.214
## root vegetables    0.148 0.176          0.141      0.197  0.162
## sausage            0.000 0.175          0.142      0.192  0.182
## soda               0.175 0.000          0.176      0.243  0.195
## tropical fruit     0.142 0.176          0.000      0.202  0.172
## whole milk         0.192 0.243          0.202      0.000  0.255
## yogurt             0.182 0.195          0.172      0.255  0.000
## Slot "method":
## [1] "Affinity"

Conclusions

Using Market Basket Analysis we can extract rules that are describing customers choices about which product to choose. In our case we could succesfully extract 6 rules that determine customers probability of buying whole milk. The similar analysis may be conducted for all other products to for example decide whats optimal layout of the glocery, which products are often bought with other products and concluding -> which products should be near others (in the range of sight). It may influence customers to buy another product.