Basket analysis

##Robert Sidorowski

Introduction

Using the required packages:

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## <U+221A> ggplot2 3.2.1     <U+221A> purrr   0.3.3
## <U+221A> tibble  2.1.3     <U+221A> dplyr   0.8.4
## <U+221A> tidyr   1.0.2     <U+221A> stringr 1.4.0
## <U+221A> readr   1.3.1     <U+221A> forcats 0.4.0
## Warning: package 'stringr' was built under R version 3.6.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(stringr)
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.6.3
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 3.6.3
## Loading required package: grid
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus

Basket analysis is one of the basic tools for testing customer shopping preferences. Based on the results, you can arrange the products in the store or prepare promotions. You can also use it to prepare a recommendation algorithm.

I don’t use any database in my work. I assumed that I have a shop with only 7 products. These products are rolls, bananas, butter, milk, honey, jam and pears. I assume that 10 people come to the store and everyone buys what they need. Their purchases are as follows:

  1. Bunns
  2. Bunns, Butter, Milk
  3. Milk, Bananas
  4. Pears
  5. Butter, Bananas
  6. Bunns, Butter
  7. Bunns, Jam
  8. Bunns, Pears
  9. Honey
  10. Honey, Jam
transactionslist <- list("Bunn",
                     c("Bunn", "Butter", "Milk"),
                     c("Milk","Banana"),
                     "Pear",
                     c("Butter","Banana"),
                     c("Bunn","Butter"),
                     c("Bunn","Jam"),
                     c("Bunn","Pear"),
                     "Honey",
                     c("Honey","Jam"))

I am now converting our list to a transaction object and summarizing it

transobject <- as(transactionslist, "transactions")
summary(transobject)
## transactions as itemMatrix in sparse format with
##  10 rows (elements/itemsets/transactions) and
##  7 columns (items) and a density of 0.2571429 
## 
## most frequent items:
##    Bunn  Butter  Banana   Honey     Jam (Other) 
##       5       3       2       2       2       4 
## 
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 
## 3 6 1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.25    2.00    1.80    2.00    3.00 
## 
## includes extended item information - examples:
##   labels
## 1 Banana
## 2   Bunn
## 3 Butter

From the summary we can see that we have 10 transactions and 7 products in our store.

The second important thing about this summary is that it shows us which products are the most popular and shows the number of shopping baskets in which these products are found.

We can also see the number of products in the baskets and its distribution. The sublimation shows that we had three purchases of one product, six purchases of two products and one purchase of three purchases.

hist(size(transobject), col='blue', n=10)

tab <- crossTable(transobject, measure = 'support', sort = TRUE)
print(tab, digits=2)
##        Bunn Butter Banana Honey Jam Milk Pear
## Bunn    0.5    0.2    0.0   0.0 0.1  0.1  0.1
## Butter  0.2    0.3    0.1   0.0 0.0  0.1  0.0
## Banana  0.0    0.1    0.2   0.0 0.0  0.1  0.0
## Honey   0.0    0.0    0.0   0.2 0.1  0.0  0.0
## Jam     0.1    0.0    0.0   0.1 0.2  0.0  0.0
## Milk    0.1    0.1    0.1   0.0 0.0  0.2  0.0
## Pear    0.1    0.0    0.0   0.0 0.0  0.0  0.2
freq.items <- eclat(transobject, parameter=list(supp=0.01,maxlen=15))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE    0.01      1     15 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 0
## Warning in eclat(transobject, parameter = list(supp = 0.01, maxlen = 15)): You chose a very low absolute support count of 0. You might run out of memory! Increase minimum support.
## create itemset ... 
## set transactions ...[7 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating bit matrix ... [7 row(s), 10 column(s)] done [0.00s].
## writing  ... [16 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
inspect(freq.items)
##      items              support count
## [1]  {Honey,Jam}        0.1     1    
## [2]  {Bunn,Pear}        0.1     1    
## [3]  {Bunn,Jam}         0.1     1    
## [4]  {Banana,Butter}    0.1     1    
## [5]  {Banana,Milk}      0.1     1    
## [6]  {Bunn,Butter,Milk} 0.1     1    
## [7]  {Bunn,Milk}        0.1     1    
## [8]  {Butter,Milk}      0.1     1    
## [9]  {Bunn,Butter}      0.2     2    
## [10] {Bunn}             0.5     5    
## [11] {Butter}           0.3     3    
## [12] {Milk}             0.2     2    
## [13] {Banana}           0.2     2    
## [14] {Jam}              0.2     2    
## [15] {Pear}             0.2     2    
## [16] {Honey}            0.2     2
freq.items.sorted<-sort(freq.items, by="support", decreasing=TRUE)
inspect(freq.items.sorted)
##      items              support count
## [1]  {Bunn}             0.5     5    
## [2]  {Butter}           0.3     3    
## [3]  {Bunn,Butter}      0.2     2    
## [4]  {Milk}             0.2     2    
## [5]  {Banana}           0.2     2    
## [6]  {Jam}              0.2     2    
## [7]  {Pear}             0.2     2    
## [8]  {Honey}            0.2     2    
## [9]  {Honey,Jam}        0.1     1    
## [10] {Bunn,Pear}        0.1     1    
## [11] {Bunn,Jam}         0.1     1    
## [12] {Banana,Butter}    0.1     1    
## [13] {Banana,Milk}      0.1     1    
## [14] {Bunn,Butter,Milk} 0.1     1    
## [15] {Bunn,Milk}        0.1     1    
## [16] {Butter,Milk}      0.1     1

This is how the matrix looks like, showing us the basket-product relationship:

transobject@data %>%
  t() %>%
  data.matrix() %>%
  as.data.frame() %>%
  rename("Bunn" = V1, "Banana" = V2, "Butter" = V3, "Milk" = V4, "Honey" = V5, "Jam" = V6, 'Pear' = V7) %>%
  mutate(trans = row_number()) %>%
  gather(key = key, value = val, -trans) %>%
  ggplot() +
  geom_tile(aes(key, trans, fill=val), color = "black") +
  scale_fill_manual(values = c("TRUE" = "darkgray", "FALSE" = "white")) +
  scale_y_continuous(breaks = 1:10, trans = "reverse") +
  labs(x = "Product", y = "NUmber of transaction", fill = "Purchase?")

Below we have a simplified version of the chart above. Simplified because they are axes are created equal and do not know about products and questions regarding possible connections.

image(data.matrix(t(transobject@data)), col=c("white", "black"))

itemFrequencyPlot(transobject)

From this chart, we can see that the buns were bought the most times, customers bought the rolls 5 times. The second most bought product is butter. The rest of the products remain at the same level.

itemsets <- apriori(transobject,
                    parameter = list(supp = 0.01,
                                     conf = 0.01,
                                     target = "rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.01    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 0 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [26 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(itemsets)
##      lhs              rhs      support confidence lift     count
## [1]  {}            => {Pear}   0.2     0.2000000  1.000000 2    
## [2]  {}            => {Honey}  0.2     0.2000000  1.000000 2    
## [3]  {}            => {Banana} 0.2     0.2000000  1.000000 2    
## [4]  {}            => {Jam}    0.2     0.2000000  1.000000 2    
## [5]  {}            => {Milk}   0.2     0.2000000  1.000000 2    
## [6]  {}            => {Butter} 0.3     0.3000000  1.000000 3    
## [7]  {}            => {Bunn}   0.5     0.5000000  1.000000 5    
## [8]  {Pear}        => {Bunn}   0.1     0.5000000  1.000000 1    
## [9]  {Bunn}        => {Pear}   0.1     0.2000000  1.000000 1    
## [10] {Honey}       => {Jam}    0.1     0.5000000  2.500000 1    
## [11] {Jam}         => {Honey}  0.1     0.5000000  2.500000 1    
## [12] {Banana}      => {Milk}   0.1     0.5000000  2.500000 1    
## [13] {Milk}        => {Banana} 0.1     0.5000000  2.500000 1    
## [14] {Banana}      => {Butter} 0.1     0.5000000  1.666667 1    
## [15] {Butter}      => {Banana} 0.1     0.3333333  1.666667 1    
## [16] {Jam}         => {Bunn}   0.1     0.5000000  1.000000 1    
## [17] {Bunn}        => {Jam}    0.1     0.2000000  1.000000 1    
## [18] {Milk}        => {Butter} 0.1     0.5000000  1.666667 1    
## [19] {Butter}      => {Milk}   0.1     0.3333333  1.666667 1    
## [20] {Milk}        => {Bunn}   0.1     0.5000000  1.000000 1    
## [21] {Bunn}        => {Milk}   0.1     0.2000000  1.000000 1    
## [22] {Butter}      => {Bunn}   0.2     0.6666667  1.333333 2    
## [23] {Bunn}        => {Butter} 0.2     0.4000000  1.333333 2    
## [24] {Butter,Milk} => {Bunn}   0.1     1.0000000  2.000000 1    
## [25] {Bunn,Milk}   => {Butter} 0.1     1.0000000  3.333333 1    
## [26] {Bunn,Butter} => {Milk}   0.1     0.5000000  2.500000 1

If we look at the rules that have been calculated, i.e. in this case there are 26 rules, we can see the following:

If the lhs column is empty, it means that there was only one product in the basket.

The next thing we can see is that if someone bought honey, for example, they could also have jam in their basket. The support factor in this case is equal to 0.1, which means that 1 out of 10 transactions just looked like this. It works in two directions, i.e. having jam in the basket, we bought honey. We don’t know the order in which the products are put in the basket, but we can conclude that the rules are symmetrical.

If we look at the confidence factor for example for the rule number 15, we can see that 33% of customers who bought butter also bought milk.

After looking at the lift coefficient, we can see that rule 23 has the largest lift coefficient. The higher the lift coefficient, the more likely it is that taking given products from the lhs column, we will buy the one from the rhs column.