Grocery - Market Basket Analysis

Introduction

Association rules are classified as unsupervised learning methods, where we are trying to find relations or patterns among large sets of data items. One of such a method is called “Market Basket Analysis” (MBA) which is based on the rule so called “if-then”. In other words, this method might be helpful to estimate the behaviour of a client who does the shopping, by estimating, which other products the customer is going to select, basing on his current products in basket. For example, we might be interested in calculating, what is the chance that client who already took a butter into his basket, will also take a bread.

In order to reach the best possible result, several metrics have been highlighted that appear to be important. These are:

support - number of how often a specific set of products appears in all the orders;
confidence - indicates the strength of the rule, how much two sides of the rule are linked;
expected confidence - confidence divided by number of transactions;
lift- confidence divided by expected confidence, it is the indicator of how strong the items are linked.

By and large, in order to achieve the best possible, maximisation of the above statistics is required.

Libraries and dataset

# Data handling
library(tidyverse)
library(kableExtra)
#library(plyr) - dply() function for making a baskets and mapvalues() function for recoding the data

# Assosciation rules
library(arules)
library(arulesViz)

grocery_org <- read.csv('Groceries_dataset.csv', sep = ",")
dim(grocery_org)

## [1] 38765     3

The dataset for this project comes from https://www.kaggle.com/heeraldedhia/groceries-dataset. It’s dimension is equal to 38765 x 3. The variables are as follows:

Member_number - basically it is ID of each customer who has/bought a product
Date - date of buying each product
itemDescription - the description of each product

kable(head(grocery_org)) %>% 
    kable_styling(bootstrap_options = c("striped", "hover"))

Member_number	Date	itemDescription
1808	21-07-2015	tropical fruit
2552	05-01-2015	whole milk
2300	19-09-2015	pip fruit
1187	12-12-2015	other vegetables
3037	01-02-2015	whole milk
4941	14-02-2015	rolls/buns

Exploratory Data Analysis

Despite having a small amount of variables, it is definitely worth delve into the dataset to find maybe some interesting dependecies.

summary(grocery_org)

##  Member_number      Date           itemDescription   
##  Min.   :1000   Length:38765       Length:38765      
##  1st Qu.:2002   Class :character   Class :character  
##  Median :3005   Mode  :character   Mode  :character  
##  Mean   :3004                                        
##  3rd Qu.:4007                                        
##  Max.   :5000

We see that summarising the data, at the first glance hasn’t give me valuable information. Nevertheless, it is worth to higlight that variable “Date” is saved as as character variable. It might be advisable to change it.

# I don't want to affect my original data, so just in case I will make a copy 
grocery <- as.data.frame(grocery_org)
grocery$Date <- as.Date(grocery$Date, "%d-%m-%y")
class(grocery$Date) # Check

## [1] "Date"

Now, in order to better understand the data, I will separate variable Date into new ones: Year, Month, Day.

grocery_final <- separate(grocery, "Date", c("Year", "Month", "Day"), sep = "-")
grocery_final$Date <- grocery$Date

kable(head(grocery_final)) %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Member_number	Year	Month	Day	itemDescription	Date
1808	2020	07	21	tropical fruit	2020-07-21
2552	2020	01	05	whole milk	2020-01-05
2300	2020	09	19	pip fruit	2020-09-19
1187	2020	12	12	other vegetables	2020-12-12
3037	2020	02	01	whole milk	2020-02-01
4941	2020	02	14	rolls/buns	2020-02-14

# By the way, let's see how many different dates I have in my analysis
print(paste("The number of unique years is equal to:", length(unique(grocery_final$Year))))

## [1] "The number of unique years is equal to: 1"

print(paste("The number of unique months is equal to:", length(unique(grocery_final$Month))))

## [1] "The number of unique months is equal to: 12"

print(paste("The number of unique days is equal to:", length(unique(grocery_final$Day))))

## [1] "The number of unique days is equal to: 31"

Let’s look for NA values and for potential duplicates. But, be careful because it may occur that one client the same day has bought two and more pieces of some product.

# Look for NaN's
colSums(is.na(grocery_final))

##   Member_number            Year           Month             Day itemDescription 
##               0               0               0               0               0 
##            Date 
##               0

sum(duplicated(grocery_final))

## [1] 762

kable(head(grocery_final[duplicated(grocery_final), ])) %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

	Member_number	Year	Month	Day	itemDescription	Date
5016	2051	2020	09	11	frankfurter	2020-09-11
5023	3055	2020	08	18	other vegetables	2020-08-18
5045	1994	2020	03	11	whole milk	2020-03-11
5056	1682	2020	06	25	pip fruit	2020-06-25
5060	4324	2020	01	05	sausage	2020-01-05
5066	2694	2020	11	24	whole milk	2020-11-24

# The data may have potential duplicates. In this let's check one client
kable(grocery_final[grocery_final$Member_number == 2051 & grocery_final$itemDescription == "other vegetables", ]) %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

	Member_number	Year	Month	Day	itemDescription	Date
2222	2051	2020	07	23	other vegetables	2020-07-23
18371	2051	2020	07	23	other vegetables	2020-07-23
24376	2051	2020	04	20	other vegetables	2020-04-20
29435	2051	2020	12	03	other vegetables	2020-12-03

As we see, there are some duplicates. However we shouldn’t be worried, because I don’t know what exact vegetables these clients have bought. That’s why I decided to not removing any record (the MBA algorithm later will remove the duplicates on its own). Having in mind, that there are no missing values, let’s check, which customer has bought the most products.

# Checking the uniqueness of customers (one customer is like a one basket)
length(unique(grocery_final$Member_number))

## [1] 3898

# Because I have quite a lot of shopping carts, the optimal solution will be to limit the number of customer = 6
head(sort(table(grocery_final$Member_number), decreasing = TRUE))

## 
## 3180 2051 3050 3737 2271 2433 
##   36   33   33   33   31   31

We can see that the record holders bought more than 30 products during the year. For these 6 customers, I decided to check, how their purchases decomposed over time (in months). This will help me in assesing, whether they were regular customers or one-offs.

Visualisations

par(mfrow = c(2,3))

  grocery_final%>%
  filter(Member_number %in% c(3180, 2051, 3050, 3737, 2271, 2433)) %>% 
  group_by(Member_number, Month) %>% 
  mutate(Occurencies_per_month = (count = n())) %>% 
  select(Member_number, Month, Occurencies_per_month) %>% 
  ggplot(mapping = aes(x = Month, 
                       y = Occurencies_per_month, 
                       group = Member_number, 
                       colour = factor(Member_number))) +
    geom_line(show.legend = FALSE) +
    ggtitle("Monthly distribution of the products bought") +
    xlab("Months") +
    ylab("Amount of bought products") +
    theme_bw() +
    facet_wrap(~Member_number, scales = "free", 
               labeller = as_labeller(c("2051" = "Customer ID: 2051",
                                        "2271" = "Customer ID: 2271",
                                        "2433" = "Customer ID: 2433",
                                        "3050" = "Customer ID: 3050",
                                        "3180" = "Customer ID: 3180",
                                        "3737" = "Customer ID: 3737")))

As we see on above charts, all the top clients are definitely not one-time customers and their transactions are distributed unevenly throughout the year. It might also be tempted to conclude that the customers purchasing decisions were independent of each other. What’s more, we cannot distinguish any pattern that all customers e.g. make more purchases during the Christmas (December).

NOTE

I am aware that my data is not included in a single point in time. However, for the purpose of practicing the algorithm, I will assume that all purchases were made at the same time. So now, for each customer, I will calculate the total amount of purchased products.

grocery_final <- 
  grocery_final%>%
  group_by(Member_number) %>% 
  mutate(Occurencies = (count = n()))

basket_size_plot <- ggplot(data = grocery_final, mapping = aes(x = Occurencies, y = 1)) +
                      geom_violin(fill='#A4A4A4') +
                      ggtitle("Violin plot for the customers baskets") +
                      xlab("The size of the basket") +
                      ylab("")

basket_size_plot + geom_boxplot(width=0.1)

Now, let me check some dependecies of a basket size and assess which product was bought the most times.

products_countplot <- grocery_final %>%
                        group_by(itemDescription) %>%
                        summarise(counts = n()) %>%
                        top_n(10) %>% 
                        ggplot(aes(x = reorder(itemDescription, -counts), y = counts)) +
                          geom_bar(fill = "mediumpurple3", stat = "identity") +
                          ggtitle("Countplot of ten the most popular products") +
                          xlab("Products") +
                          ylab("") +
                          theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
products_countplot

Whole milk is the most selected product. From the above graph, it can also be inferred that the product group for vegetables is somehow splitted, which can significantly vary future results.

Interesting fact

An auxiliary graphical function (ggpubr::itemFrequencyPlot()) has also been created for the MBA algorithm, which shows the most frequently purchased/selected items from the shopping cart. The resulting barplot/countplot should be equivalent to what Ihave presented above.

Assocation rules

Preparation

After analysing the data, let’s transform it to the form which will enable me to implement the Assocation rules algorithm.

Now, I will convert the ID (Member_number variable) of each customerthe categorical format.

converted_grocery <- grocery_final[order(grocery_final$Member_number),]
converted_grocery$Member_number <- as.numeric(converted_grocery$Member_number)
glimpse(converted_grocery)

## Rows: 38,765
## Columns: 7
## Groups: Member_number [3,898]
## $ Member_number   <dbl> 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, ~
## $ Year            <chr> "2020", "2020", "2020", "2020", "2020", "2020", "2020"~
## $ Month           <chr> "05", "07", "03", "11", "03", "06", "05", "07", "03", ~
## $ Day             <chr> "27", "24", "15", "25", "15", "24", "27", "24", "15", ~
## $ itemDescription <chr> "soda", "canned beer", "sausage", "sausage", "whole mi~
## $ Date            <date> 2020-05-27, 2020-07-24, 2020-03-15, 2020-11-25, 2020-~
## $ Occurencies     <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 12~

After that, I will group all the items that were bought by each customer on the same date, in order to make some baskets.

items <- plyr::ddply(converted_grocery, c("Member_number","Date"), function(temp_df)paste(temp_df$itemDescription,collapse = ","))
items <- subset(items, select = V1)
head(items)

##                                              V1
## 1 sausage,whole milk,semi-finished bread,yogurt
## 2                       soda,pickled vegetables
## 3                 whole milk,pastry,salty snack
## 4                   canned beer,misc. beverages
## 5                      sausage,hygiene articles
## 6           frankfurter,soda,whipped/sour cream

Finally, I am saving the new data and then, opening it with an appriopriate function.

write.csv(items, file = "transactions.csv", quote = FALSE,  row.names = T)
basket_df <- arules::read.transactions("transactions.csv", format = "basket", sep = ",", cols = 1)

## Warning in asMethod(object): removing duplicated items in transactions

print(basket_df)

## transactions in sparse format with
##  14935 transactions (rows) and
##  168 items (columns)

At this point, deliberately I did not turn off warning messages, because when we create baskets, we get an important message telling us that the function will remove the duplicates on its own. This is consistent with what I mentioned earlier during the exploratory part.

Apriori algorithm

With the data already prepared and cleaned, I will now proceed to implement the MBA method. I will first implement the apriori() algorithm, which seems to be the most helpful. In general, we are interested in transformations of X into Y, which allows us to calculate the 3 measures, which I mentioned earlier: support, confidence and lift.

basket_apriori <- apriori(basket_df, parameter = list(sup = 0.001, conf = 0.05, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.05    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 14 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14935 transaction(s)] done [0.00s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [452 rule(s)] done [0.19s].
## creating S4 object  ... done [0.00s].

summary(basket_apriori)

## set of 452 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3 
## 425  27 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.00    2.00    2.06    2.00    3.00 
## 
## summary of quality measures:
##     support           confidence         coverage             lift       
##  Min.   :0.001004   Min.   :0.05000   Min.   :0.005357   Min.   :0.5185  
##  1st Qu.:0.001272   1st Qu.:0.06394   1st Qu.:0.016003   1st Qu.:0.7662  
##  Median :0.001942   Median :0.08102   Median :0.023636   Median :0.8337  
##  Mean   :0.002768   Mean   :0.08775   Mean   :0.033714   Mean   :0.8873  
##  3rd Qu.:0.003365   3rd Qu.:0.10492   3rd Qu.:0.043790   3rd Qu.:0.9621  
##  Max.   :0.014931   Max.   :0.25581   Max.   :0.158219   Max.   :2.1659  
##      count       
##  Min.   : 15.00  
##  1st Qu.: 19.00  
##  Median : 29.00  
##  Mean   : 41.34  
##  3rd Qu.: 50.25  
##  Max.   :223.00  
## 
## mining info:
##       data ntransactions support confidence
##  basket_df         14935   0.001       0.05
##                                                                               call
##  apriori(data = basket_df, parameter = list(sup = 0.001, conf = 0.05, minlen = 2))

inspect(basket_apriori[1:5])

##     lhs                    rhs                support     confidence
## [1] {frozen fish}       => {whole milk}       0.001071309 0.1568627 
## [2] {seasonal products} => {rolls/buns}       0.001004352 0.1415094 
## [3] {pot plants}        => {other vegetables} 0.001004352 0.1282051 
## [4] {pot plants}        => {whole milk}       0.001004352 0.1282051 
## [5] {pasta}             => {whole milk}       0.001071309 0.1322314 
##     coverage    lift      count
## [1] 0.006829595 0.9914283 16   
## [2] 0.007097422 1.2839875 15   
## [3] 0.007833947 1.0480260 15   
## [4] 0.007833947 0.8103020 15   
## [5] 0.008101774 0.8357495 16

inspect(head(sort(basket_apriori, by = "confidence", decreasing = TRUE)))

##     lhs                      rhs          support     confidence coverage   
## [1] {sausage, yogurt}     => {whole milk} 0.001473050 0.2558140  0.005758286
## [2] {rolls/buns, sausage} => {whole milk} 0.001138266 0.2125000  0.005356545
## [3] {sausage, soda}       => {whole milk} 0.001071309 0.1797753  0.005959156
## [4] {semi-finished bread} => {whole milk} 0.001673920 0.1760563  0.009507867
## [5] {rolls/buns, yogurt}  => {whole milk} 0.001339136 0.1694915  0.007900904
## [6] {sausage, whole milk} => {yogurt}     0.001473050 0.1641791  0.008972213
##     lift     count
## [1] 1.616835 22   
## [2] 1.343076 17   
## [3] 1.136244 16   
## [4] 1.112739 25   
## [5] 1.071247 20   
## [6] 1.908183 22

Based on the above table (confidence variable), we see that one in four people who bought sausage and yogurt has also bought whole milk. What’s more, it occured that whole milk, actually is the most desirable product for the first five baskets.

I will now present the obtained results graphically in different views. Despite the various methods, the information contained in the graphs should present the same outcome.

plot(basket_apriori[1:20], method="matrix", measure="lift")

## Itemsets in Antecedent (LHS)
##  [1] "{packaged fruit/vegetables}" "{seasonal products}"        
##  [3] "{herbs}"                     "{processed cheese}"         
##  [5] "{flour}"                     "{detergent}"                
##  [7] "{red/blush wine}"            "{frozen fish}"              
##  [9] "{semi-finished bread}"       "{pot plants}"               
## [11] "{pasta}"                     "{pickled vegetables}"       
## Itemsets in Consequent (RHS)
## [1] "{whole milk}"       "{other vegetables}" "{rolls/buns}"      
## [4] "{yogurt}"           "{root vegetables}"  "{tropical fruit}"

plot(basket_apriori[1:20], method="grouped")

plot(basket_apriori[1:20], method="graph")

plot(basket_apriori[1:20], method="paracoord", control=list(reorder=TRUE))

Jaccard index

Despite basic measures such as support, confidence, and lift, which are recognized as symbols of Assosiation Rules, many more can be included in our analysis. One of them is certainly the Jaccard index.

Without going into a mathematical formula, this measure calculates the probability that two items (in my case products) will be bought together. The result we obtain is a probability matrix between the products. The closer the value is to 1, the lower the probability that the products will appear in the same basket.

basket_index <-basket_df[, itemFrequency(basket_df)>0.05] 
jac_index<-dissimilarity(basket_index, which="items", method = "jaccard") 
round(jac_index, 2)

##                  bottled water citrus fruit other vegetables pastry rolls/buns
## citrus fruit              0.98                                                
## other vegetables          0.97         0.97                                   
## pastry                    0.97         0.98             0.98                  
## rolls/buns                0.97         0.97             0.95   0.98           
## root vegetables           0.97         0.98             0.97   0.98       0.97
## sausage                   0.97         0.99             0.97   0.97       0.97
## soda                      0.97         0.97             0.95   0.97       0.96
## tropical fruit            0.97         0.98             0.97   0.98       0.96
## whole milk                0.97         0.96             0.94   0.97       0.94
## yogurt                    0.97         0.96             0.96   0.97       0.96
##                  root vegetables sausage soda tropical fruit whole milk
## citrus fruit                                                           
## other vegetables                                                       
## pastry                                                                 
## rolls/buns                                                             
## root vegetables                                                        
## sausage                     0.97                                       
## soda                        0.97    0.96                               
## tropical fruit              0.97    0.98 0.97                          
## whole milk                  0.97    0.96 0.95           0.96           
## yogurt                      0.97    0.96 0.97           0.96       0.95

As we can see, the distribution of this matrix oscillates above a probability of 95%. In that case, we can conclude that there is a small chance of observing the above products in the same basket.

Because of the fact that my data is dissimilar I will plot also the dendogram.

plot(hclust(jac_index, method="ward.D2"), main="Dendrogram for trans")

Diary products

Whole milk

From the above analysis, I know that milk turned out to be the most frequently selected product. Now, I’ll check what exactly were people buying, before selecting whole milk next.

whole_milk_apriori <- apriori(basket_df, parameter = list(supp=0.001, conf=0.15),appearance = list(default="lhs",rhs="whole milk"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.15    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 14 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14935 transaction(s)] done [0.01s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [10 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(whole_milk_apriori, linebreak = FALSE)

##      lhs                      rhs          support     confidence coverage   
## [1]  {}                    => {whole milk} 0.158218949 0.1582189  1.000000000
## [2]  {frozen fish}         => {whole milk} 0.001071309 0.1568627  0.006829595
## [3]  {detergent}           => {whole milk} 0.001406093 0.1627907  0.008637429
## [4]  {semi-finished bread} => {whole milk} 0.001673920 0.1760563  0.009507867
## [5]  {ham}                 => {whole milk} 0.002745229 0.1601562  0.017140944
## [6]  {bottled beer}        => {whole milk} 0.007164379 0.1578171  0.045396719
## [7]  {sausage, yogurt}     => {whole milk} 0.001473050 0.2558140  0.005758286
## [8]  {sausage, soda}       => {whole milk} 0.001071309 0.1797753  0.005959156
## [9]  {rolls/buns, sausage} => {whole milk} 0.001138266 0.2125000  0.005356545
## [10] {rolls/buns, yogurt}  => {whole milk} 0.001339136 0.1694915  0.007900904
##      lift      count
## [1]  1.0000000 2363 
## [2]  0.9914283   16 
## [3]  1.0288951   21 
## [4]  1.1127386   25 
## [5]  1.0122444   41 
## [6]  0.9974602  107 
## [7]  1.6168351   22 
## [8]  1.1362437   16 
## [9]  1.3430755   17 
## [10] 1.0712467   20

Deliberately, I increased the confidence level in order to focus only on those products which might be crucial to the analysis. To be honest, it is a bit strange that, with an empty shopping basket, milk was initially selected as the first product Nevertheless, I think that the whole milk was chosen very randomly and not e.g. as part of a given recipe for some cake. So I will check how the situation looks like for another type of diary product which is yogurt. If the results turn out to be similar, I may have a suspicion that the customers of my “store” were very diversified.

Yogurt

yogurt_apriori <- apriori(basket_df, parameter = list(supp=0.001, conf=0.10), appearance = list(default="lhs",rhs="yogurt"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 14 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14935 transaction(s)] done [0.00s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [5 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(yogurt_apriori, linebreak = FALSE)

##     lhs                      rhs      support     confidence coverage   
## [1] {detergent}           => {yogurt} 0.001071309 0.1240310  0.008637429
## [2] {herbs}               => {yogurt} 0.001138266 0.1075949  0.010579176
## [3] {soft cheese}         => {yogurt} 0.001272179 0.1266667  0.010043522
## [4] {chewing gum}         => {yogurt} 0.001406093 0.1166667  0.012052226
## [5] {sausage, whole milk} => {yogurt} 0.001473050 0.1641791  0.008972213
##     lift     count
## [1] 1.441559 16   
## [2] 1.250529 17   
## [3] 1.472192 19   
## [4] 1.355966 21   
## [5] 1.908183 22

Unfortunately, yogurt was found in far fewer baskets than milk, so I decided to lower the confidence level to 0.10. It turns out that also for this product, it’s difficult to rationally explain, what was the guiding principle for customers, who combined e.g. yogurt with detergent. What is more, basing on the low level of confidence statistic for all the following 5 baskets, it is hard to find any relationship between the products.

“Feature engineering”

To be honest, I am not quite happy with the results that I get, so I will make some kind of a trick and recode some of the products by changing their names to more generic equivalents. Similar products will be categorized into one group, which should increase the value of confidence statistics.

# Old names
names.real<-c("whole milk", "cream cheese ", "yogurt", "root vegetables", "frozen vegetables", "other vegetables", "ham", "beef", "sausage") #

# New names
names.new <-c("diary", "diary", "diary", "vegetables", "vegetables", "vegetables", "meat", "meat", "meat")

# Recode the products
data_fe <- as.data.frame(converted_grocery)
data_fe$itemDescription <- plyr::mapvalues(data_fe$itemDescription, names.real, names.new)

# Check
head(data_fe[data_fe$itemDescription %in% names.new, ])

##    Member_number Year Month Day itemDescription       Date Occurencies
## 3           1000 2020    03  15            meat 2020-03-15          13
## 4           1000 2020    11  25            meat 2020-11-25          13
## 5           1000 2020    03  15           diary 2020-03-15          13
## 6           1000 2020    06  24           diary 2020-06-24          13
## 11          1000 2020    03  15           diary 2020-03-15          13
## 16          1001 2020    04  14            meat 2020-04-14          12

As before, I will create some baskets.

items_fe <- plyr::ddply(data_fe, c("Member_number","Date"), function(temp_df)paste(temp_df$itemDescription,collapse = ","))
items_fe <- subset(items_fe, select = V1)

write.csv(items_fe, file = "transactions_fe.csv", quote = FALSE,  row.names = T)
basket_df_fe <- arules::read.transactions("transactions_fe.csv", format = "basket", sep = ",", cols = 1)
print(basket_df_fe) # The number of rows hasn't changed, but the number of columns has decreased by 7.

## transactions in sparse format with
##  14935 transactions (rows) and
##  161 items (columns)

basket_apriori_fe <- apriori(basket_df_fe, parameter = list(sup = 0.001, conf = 0.05, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.05    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 14 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[161 item(s), 14935 transaction(s)] done [0.00s].
## sorting and recoding items ... [142 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [482 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(basket_apriori_fe)

## set of 482 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3 
## 374 108 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.224   2.000   3.000 
## 
## summary of quality measures:
##     support           confidence         coverage             lift       
##  Min.   :0.001004   Min.   :0.05000   Min.   :0.002946   Min.   :0.4994  
##  1st Qu.:0.001272   1st Qu.:0.08146   1st Qu.:0.010194   1st Qu.:0.7409  
##  Median :0.001875   Median :0.12320   Median :0.017944   Median :0.8173  
##  Mean   :0.003317   Mean   :0.13410   Mean   :0.028575   Mean   :0.8583  
##  3rd Qu.:0.003415   3rd Qu.:0.18154   3rd Qu.:0.034985   3rd Qu.:0.9306  
##  Max.   :0.040911   Max.   :0.34091   Max.   :0.252293   Max.   :1.7095  
##      count       
##  Min.   : 15.00  
##  1st Qu.: 19.00  
##  Median : 28.00  
##  Mean   : 49.54  
##  3rd Qu.: 51.00  
##  Max.   :611.00  
## 
## mining info:
##          data ntransactions support confidence
##  basket_df_fe         14935   0.001       0.05
##                                                                                  call
##  apriori(data = basket_df_fe, parameter = list(sup = 0.001, conf = 0.05, minlen = 2))

inspect(head(sort(basket_apriori_fe, by = "confidence", decreasing = TRUE)))

##     lhs                      rhs          support     confidence coverage   
## [1] {frankfurter, meat}   => {vegetables} 0.001004352 0.3409091  0.002946100
## [2] {meat, pork}          => {diary}      0.001004352 0.3191489  0.003146970
## [3] {finished products}   => {diary}      0.001205223 0.2812500  0.004285236
## [4] {citrus fruit, meat}  => {diary}      0.001205223 0.2769231  0.004352193
## [5] {meat, shopping bags} => {diary}      0.001138266 0.2741935  0.004151322
## [6] {meat, pastry}        => {diary}      0.001540007 0.2738095  0.005624372
##     lift     count
## [1] 1.625631 15   
## [2] 1.264992 15   
## [3] 1.114774 18   
## [4] 1.097624 18   
## [5] 1.086805 17   
## [6] 1.085283 23

Analyzing the above results, we can see that I managed to artificially increase the value of the confidence statistic, even by 10% by renaming only 9 products. Compared to the first iteration of the algorithm implementation, the initial shopping baskets have also changed. We can also notice that in many of them, meat products have appeared for the first time. Therefore, I think that creating another group, based on meat products would improve the results even more.

Conclusion

In the above analysis, the Association rules algorithm was used to define, what next product the customers would be inclined to choose based on their current basket of goods. On the basis of the conducted method, several conclusions can be drawn. The main one is that this algorithm, can tell us little, if we have multiple options (in my case, food products).

Therefore, it seems that for a supermarket manager, this algorithm would probably prove to be moderately helpful. Nevertheless, it has been shown that by artificially allocating products to certain groups, it is possible to better understand the preferences of customers on the basis of their shopping baskets. Such a trick can be helpful in establishing store departments or shelves. Very often when passing through a supermarket, we can notice that beer products are closer to snacks, which may affect the customer’s thinking. What’s more, managers can also think about introducing some sales or get rid of the goods remaining in stock by creating a special promotions like: “When purchasing product Y, you will receive a 50% discount on the purchase of product X”.