Association rules are classified as unsupervised learning methods, where we are trying to find relations or patterns among large sets of data items. One of such a method is called “Market Basket Analysis” (MBA) which is based on the rule so called “if-then”. In other words, this method might be helpful to estimate the behaviour of a client who does the shopping, by estimating, which other products the customer is going to select, basing on his current products in basket. For example, we might be interested in calculating, what is the chance that client who already took a butter into his basket, will also take a bread.
In order to reach the best possible result, several metrics have been highlighted that appear to be important. These are:
By and large, in order to achieve the best possible, maximisation of the above statistics is required.
# Data handling
library(tidyverse)
library(kableExtra)
#library(plyr) - dply() function for making a baskets and mapvalues() function for recoding the data
# Assosciation rules
library(arules)
library(arulesViz)
grocery_org <- read.csv('Groceries_dataset.csv', sep = ",")
dim(grocery_org)
## [1] 38765 3
The dataset for this project comes from https://www.kaggle.com/heeraldedhia/groceries-dataset. It’s dimension is equal to 38765 x 3. The variables are as follows:
kable(head(grocery_org)) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Member_number | Date | itemDescription |
|---|---|---|
| 1808 | 21-07-2015 | tropical fruit |
| 2552 | 05-01-2015 | whole milk |
| 2300 | 19-09-2015 | pip fruit |
| 1187 | 12-12-2015 | other vegetables |
| 3037 | 01-02-2015 | whole milk |
| 4941 | 14-02-2015 | rolls/buns |
Despite having a small amount of variables, it is definitely worth delve into the dataset to find maybe some interesting dependecies.
summary(grocery_org)
## Member_number Date itemDescription
## Min. :1000 Length:38765 Length:38765
## 1st Qu.:2002 Class :character Class :character
## Median :3005 Mode :character Mode :character
## Mean :3004
## 3rd Qu.:4007
## Max. :5000
We see that summarising the data, at the first glance hasn’t give me valuable information. Nevertheless, it is worth to higlight that variable “Date” is saved as as character variable. It might be advisable to change it.
# I don't want to affect my original data, so just in case I will make a copy
grocery <- as.data.frame(grocery_org)
grocery$Date <- as.Date(grocery$Date, "%d-%m-%y")
class(grocery$Date) # Check
## [1] "Date"
Now, in order to better understand the data, I will separate variable Date into new ones: Year, Month, Day.
grocery_final <- separate(grocery, "Date", c("Year", "Month", "Day"), sep = "-")
grocery_final$Date <- grocery$Date
kable(head(grocery_final)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Member_number | Year | Month | Day | itemDescription | Date |
|---|---|---|---|---|---|
| 1808 | 2020 | 07 | 21 | tropical fruit | 2020-07-21 |
| 2552 | 2020 | 01 | 05 | whole milk | 2020-01-05 |
| 2300 | 2020 | 09 | 19 | pip fruit | 2020-09-19 |
| 1187 | 2020 | 12 | 12 | other vegetables | 2020-12-12 |
| 3037 | 2020 | 02 | 01 | whole milk | 2020-02-01 |
| 4941 | 2020 | 02 | 14 | rolls/buns | 2020-02-14 |
# By the way, let's see how many different dates I have in my analysis
print(paste("The number of unique years is equal to:", length(unique(grocery_final$Year))))
## [1] "The number of unique years is equal to: 1"
print(paste("The number of unique months is equal to:", length(unique(grocery_final$Month))))
## [1] "The number of unique months is equal to: 12"
print(paste("The number of unique days is equal to:", length(unique(grocery_final$Day))))
## [1] "The number of unique days is equal to: 31"
Let’s look for NA values and for potential duplicates. But, be careful because it may occur that one client the same day has bought two and more pieces of some product.
# Look for NaN's
colSums(is.na(grocery_final))
## Member_number Year Month Day itemDescription
## 0 0 0 0 0
## Date
## 0
sum(duplicated(grocery_final))
## [1] 762
kable(head(grocery_final[duplicated(grocery_final), ])) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Member_number | Year | Month | Day | itemDescription | Date | |
|---|---|---|---|---|---|---|
| 5016 | 2051 | 2020 | 09 | 11 | frankfurter | 2020-09-11 |
| 5023 | 3055 | 2020 | 08 | 18 | other vegetables | 2020-08-18 |
| 5045 | 1994 | 2020 | 03 | 11 | whole milk | 2020-03-11 |
| 5056 | 1682 | 2020 | 06 | 25 | pip fruit | 2020-06-25 |
| 5060 | 4324 | 2020 | 01 | 05 | sausage | 2020-01-05 |
| 5066 | 2694 | 2020 | 11 | 24 | whole milk | 2020-11-24 |
# The data may have potential duplicates. In this let's check one client
kable(grocery_final[grocery_final$Member_number == 2051 & grocery_final$itemDescription == "other vegetables", ]) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Member_number | Year | Month | Day | itemDescription | Date | |
|---|---|---|---|---|---|---|
| 2222 | 2051 | 2020 | 07 | 23 | other vegetables | 2020-07-23 |
| 18371 | 2051 | 2020 | 07 | 23 | other vegetables | 2020-07-23 |
| 24376 | 2051 | 2020 | 04 | 20 | other vegetables | 2020-04-20 |
| 29435 | 2051 | 2020 | 12 | 03 | other vegetables | 2020-12-03 |
As we see, there are some duplicates. However we shouldn’t be worried, because I don’t know what exact vegetables these clients have bought. That’s why I decided to not removing any record (the MBA algorithm later will remove the duplicates on its own). Having in mind, that there are no missing values, let’s check, which customer has bought the most products.
# Checking the uniqueness of customers (one customer is like a one basket)
length(unique(grocery_final$Member_number))
## [1] 3898
# Because I have quite a lot of shopping carts, the optimal solution will be to limit the number of customer = 6
head(sort(table(grocery_final$Member_number), decreasing = TRUE))
##
## 3180 2051 3050 3737 2271 2433
## 36 33 33 33 31 31
We can see that the record holders bought more than 30 products during the year. For these 6 customers, I decided to check, how their purchases decomposed over time (in months). This will help me in assesing, whether they were regular customers or one-offs.
par(mfrow = c(2,3))
grocery_final%>%
filter(Member_number %in% c(3180, 2051, 3050, 3737, 2271, 2433)) %>%
group_by(Member_number, Month) %>%
mutate(Occurencies_per_month = (count = n())) %>%
select(Member_number, Month, Occurencies_per_month) %>%
ggplot(mapping = aes(x = Month,
y = Occurencies_per_month,
group = Member_number,
colour = factor(Member_number))) +
geom_line(show.legend = FALSE) +
ggtitle("Monthly distribution of the products bought") +
xlab("Months") +
ylab("Amount of bought products") +
theme_bw() +
facet_wrap(~Member_number, scales = "free",
labeller = as_labeller(c("2051" = "Customer ID: 2051",
"2271" = "Customer ID: 2271",
"2433" = "Customer ID: 2433",
"3050" = "Customer ID: 3050",
"3180" = "Customer ID: 3180",
"3737" = "Customer ID: 3737")))
As we see on above charts, all the top clients are definitely not one-time customers and their transactions are distributed unevenly throughout the year. It might also be tempted to conclude that the customers purchasing decisions were independent of each other. What’s more, we cannot distinguish any pattern that all customers e.g. make more purchases during the Christmas (December).
NOTE
I am aware that my data is not included in a single point in time. However, for the purpose of practicing the algorithm, I will assume that all purchases were made at the same time. So now, for each customer, I will calculate the total amount of purchased products.
grocery_final <-
grocery_final%>%
group_by(Member_number) %>%
mutate(Occurencies = (count = n()))
basket_size_plot <- ggplot(data = grocery_final, mapping = aes(x = Occurencies, y = 1)) +
geom_violin(fill='#A4A4A4') +
ggtitle("Violin plot for the customers baskets") +
xlab("The size of the basket") +
ylab("")
basket_size_plot + geom_boxplot(width=0.1)
Now, let me check some dependecies of a basket size and assess which product was bought the most times.
products_countplot <- grocery_final %>%
group_by(itemDescription) %>%
summarise(counts = n()) %>%
top_n(10) %>%
ggplot(aes(x = reorder(itemDescription, -counts), y = counts)) +
geom_bar(fill = "mediumpurple3", stat = "identity") +
ggtitle("Countplot of ten the most popular products") +
xlab("Products") +
ylab("") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
products_countplot
Whole milk is the most selected product. From the above graph, it can also be inferred that the product group for vegetables is somehow splitted, which can significantly vary future results.
Interesting fact
An auxiliary graphical function (ggpubr::itemFrequencyPlot()) has also been created for the MBA algorithm, which shows the most frequently purchased/selected items from the shopping cart. The resulting barplot/countplot should be equivalent to what Ihave presented above.
After analysing the data, let’s transform it to the form which will enable me to implement the Assocation rules algorithm.
Now, I will convert the ID (Member_number variable) of each customerthe categorical format.
converted_grocery <- grocery_final[order(grocery_final$Member_number),]
converted_grocery$Member_number <- as.numeric(converted_grocery$Member_number)
glimpse(converted_grocery)
## Rows: 38,765
## Columns: 7
## Groups: Member_number [3,898]
## $ Member_number <dbl> 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, ~
## $ Year <chr> "2020", "2020", "2020", "2020", "2020", "2020", "2020"~
## $ Month <chr> "05", "07", "03", "11", "03", "06", "05", "07", "03", ~
## $ Day <chr> "27", "24", "15", "25", "15", "24", "27", "24", "15", ~
## $ itemDescription <chr> "soda", "canned beer", "sausage", "sausage", "whole mi~
## $ Date <date> 2020-05-27, 2020-07-24, 2020-03-15, 2020-11-25, 2020-~
## $ Occurencies <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 12~
After that, I will group all the items that were bought by each customer on the same date, in order to make some baskets.
items <- plyr::ddply(converted_grocery, c("Member_number","Date"), function(temp_df)paste(temp_df$itemDescription,collapse = ","))
items <- subset(items, select = V1)
head(items)
## V1
## 1 sausage,whole milk,semi-finished bread,yogurt
## 2 soda,pickled vegetables
## 3 whole milk,pastry,salty snack
## 4 canned beer,misc. beverages
## 5 sausage,hygiene articles
## 6 frankfurter,soda,whipped/sour cream
Finally, I am saving the new data and then, opening it with an appriopriate function.
write.csv(items, file = "transactions.csv", quote = FALSE, row.names = T)
basket_df <- arules::read.transactions("transactions.csv", format = "basket", sep = ",", cols = 1)
## Warning in asMethod(object): removing duplicated items in transactions
print(basket_df)
## transactions in sparse format with
## 14935 transactions (rows) and
## 168 items (columns)
At this point, deliberately I did not turn off warning messages, because when we create baskets, we get an important message telling us that the function will remove the duplicates on its own. This is consistent with what I mentioned earlier during the exploratory part.
With the data already prepared and cleaned, I will now proceed to implement the MBA method. I will first implement the apriori() algorithm, which seems to be the most helpful. In general, we are interested in transformations of X into Y, which allows us to calculate the 3 measures, which I mentioned earlier: support, confidence and lift.
basket_apriori <- apriori(basket_df, parameter = list(sup = 0.001, conf = 0.05, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.05 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14935 transaction(s)] done [0.00s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [452 rule(s)] done [0.19s].
## creating S4 object ... done [0.00s].
summary(basket_apriori)
## set of 452 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 425 27
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 2.00 2.00 2.06 2.00 3.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001004 Min. :0.05000 Min. :0.005357 Min. :0.5185
## 1st Qu.:0.001272 1st Qu.:0.06394 1st Qu.:0.016003 1st Qu.:0.7662
## Median :0.001942 Median :0.08102 Median :0.023636 Median :0.8337
## Mean :0.002768 Mean :0.08775 Mean :0.033714 Mean :0.8873
## 3rd Qu.:0.003365 3rd Qu.:0.10492 3rd Qu.:0.043790 3rd Qu.:0.9621
## Max. :0.014931 Max. :0.25581 Max. :0.158219 Max. :2.1659
## count
## Min. : 15.00
## 1st Qu.: 19.00
## Median : 29.00
## Mean : 41.34
## 3rd Qu.: 50.25
## Max. :223.00
##
## mining info:
## data ntransactions support confidence
## basket_df 14935 0.001 0.05
## call
## apriori(data = basket_df, parameter = list(sup = 0.001, conf = 0.05, minlen = 2))
inspect(basket_apriori[1:5])
## lhs rhs support confidence
## [1] {frozen fish} => {whole milk} 0.001071309 0.1568627
## [2] {seasonal products} => {rolls/buns} 0.001004352 0.1415094
## [3] {pot plants} => {other vegetables} 0.001004352 0.1282051
## [4] {pot plants} => {whole milk} 0.001004352 0.1282051
## [5] {pasta} => {whole milk} 0.001071309 0.1322314
## coverage lift count
## [1] 0.006829595 0.9914283 16
## [2] 0.007097422 1.2839875 15
## [3] 0.007833947 1.0480260 15
## [4] 0.007833947 0.8103020 15
## [5] 0.008101774 0.8357495 16
inspect(head(sort(basket_apriori, by = "confidence", decreasing = TRUE)))
## lhs rhs support confidence coverage
## [1] {sausage, yogurt} => {whole milk} 0.001473050 0.2558140 0.005758286
## [2] {rolls/buns, sausage} => {whole milk} 0.001138266 0.2125000 0.005356545
## [3] {sausage, soda} => {whole milk} 0.001071309 0.1797753 0.005959156
## [4] {semi-finished bread} => {whole milk} 0.001673920 0.1760563 0.009507867
## [5] {rolls/buns, yogurt} => {whole milk} 0.001339136 0.1694915 0.007900904
## [6] {sausage, whole milk} => {yogurt} 0.001473050 0.1641791 0.008972213
## lift count
## [1] 1.616835 22
## [2] 1.343076 17
## [3] 1.136244 16
## [4] 1.112739 25
## [5] 1.071247 20
## [6] 1.908183 22
Based on the above table (confidence variable), we see that one in four people who bought sausage and yogurt has also bought whole milk. What’s more, it occured that whole milk, actually is the most desirable product for the first five baskets.
I will now present the obtained results graphically in different views. Despite the various methods, the information contained in the graphs should present the same outcome.
plot(basket_apriori[1:20], method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{packaged fruit/vegetables}" "{seasonal products}"
## [3] "{herbs}" "{processed cheese}"
## [5] "{flour}" "{detergent}"
## [7] "{red/blush wine}" "{frozen fish}"
## [9] "{semi-finished bread}" "{pot plants}"
## [11] "{pasta}" "{pickled vegetables}"
## Itemsets in Consequent (RHS)
## [1] "{whole milk}" "{other vegetables}" "{rolls/buns}"
## [4] "{yogurt}" "{root vegetables}" "{tropical fruit}"
plot(basket_apriori[1:20], method="grouped")
plot(basket_apriori[1:20], method="graph")
plot(basket_apriori[1:20], method="paracoord", control=list(reorder=TRUE))
Despite basic measures such as support, confidence, and lift, which are recognized as symbols of Assosiation Rules, many more can be included in our analysis. One of them is certainly the Jaccard index.
Without going into a mathematical formula, this measure calculates the probability that two items (in my case products) will be bought together. The result we obtain is a probability matrix between the products. The closer the value is to 1, the lower the probability that the products will appear in the same basket.
basket_index <-basket_df[, itemFrequency(basket_df)>0.05]
jac_index<-dissimilarity(basket_index, which="items", method = "jaccard")
round(jac_index, 2)
## bottled water citrus fruit other vegetables pastry rolls/buns
## citrus fruit 0.98
## other vegetables 0.97 0.97
## pastry 0.97 0.98 0.98
## rolls/buns 0.97 0.97 0.95 0.98
## root vegetables 0.97 0.98 0.97 0.98 0.97
## sausage 0.97 0.99 0.97 0.97 0.97
## soda 0.97 0.97 0.95 0.97 0.96
## tropical fruit 0.97 0.98 0.97 0.98 0.96
## whole milk 0.97 0.96 0.94 0.97 0.94
## yogurt 0.97 0.96 0.96 0.97 0.96
## root vegetables sausage soda tropical fruit whole milk
## citrus fruit
## other vegetables
## pastry
## rolls/buns
## root vegetables
## sausage 0.97
## soda 0.97 0.96
## tropical fruit 0.97 0.98 0.97
## whole milk 0.97 0.96 0.95 0.96
## yogurt 0.97 0.96 0.97 0.96 0.95
As we can see, the distribution of this matrix oscillates above a probability of 95%. In that case, we can conclude that there is a small chance of observing the above products in the same basket.
Because of the fact that my data is dissimilar I will plot also the dendogram.
plot(hclust(jac_index, method="ward.D2"), main="Dendrogram for trans")
From the above analysis, I know that milk turned out to be the most frequently selected product. Now, I’ll check what exactly were people buying, before selecting whole milk next.
whole_milk_apriori <- apriori(basket_df, parameter = list(supp=0.001, conf=0.15),appearance = list(default="lhs",rhs="whole milk"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.15 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14935 transaction(s)] done [0.01s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [10 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(whole_milk_apriori, linebreak = FALSE)
## lhs rhs support confidence coverage
## [1] {} => {whole milk} 0.158218949 0.1582189 1.000000000
## [2] {frozen fish} => {whole milk} 0.001071309 0.1568627 0.006829595
## [3] {detergent} => {whole milk} 0.001406093 0.1627907 0.008637429
## [4] {semi-finished bread} => {whole milk} 0.001673920 0.1760563 0.009507867
## [5] {ham} => {whole milk} 0.002745229 0.1601562 0.017140944
## [6] {bottled beer} => {whole milk} 0.007164379 0.1578171 0.045396719
## [7] {sausage, yogurt} => {whole milk} 0.001473050 0.2558140 0.005758286
## [8] {sausage, soda} => {whole milk} 0.001071309 0.1797753 0.005959156
## [9] {rolls/buns, sausage} => {whole milk} 0.001138266 0.2125000 0.005356545
## [10] {rolls/buns, yogurt} => {whole milk} 0.001339136 0.1694915 0.007900904
## lift count
## [1] 1.0000000 2363
## [2] 0.9914283 16
## [3] 1.0288951 21
## [4] 1.1127386 25
## [5] 1.0122444 41
## [6] 0.9974602 107
## [7] 1.6168351 22
## [8] 1.1362437 16
## [9] 1.3430755 17
## [10] 1.0712467 20
Deliberately, I increased the confidence level in order to focus only on those products which might be crucial to the analysis. To be honest, it is a bit strange that, with an empty shopping basket, milk was initially selected as the first product Nevertheless, I think that the whole milk was chosen very randomly and not e.g. as part of a given recipe for some cake. So I will check how the situation looks like for another type of diary product which is yogurt. If the results turn out to be similar, I may have a suspicion that the customers of my “store” were very diversified.
yogurt_apriori <- apriori(basket_df, parameter = list(supp=0.001, conf=0.10), appearance = list(default="lhs",rhs="yogurt"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14935 transaction(s)] done [0.00s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [5 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(yogurt_apriori, linebreak = FALSE)
## lhs rhs support confidence coverage
## [1] {detergent} => {yogurt} 0.001071309 0.1240310 0.008637429
## [2] {herbs} => {yogurt} 0.001138266 0.1075949 0.010579176
## [3] {soft cheese} => {yogurt} 0.001272179 0.1266667 0.010043522
## [4] {chewing gum} => {yogurt} 0.001406093 0.1166667 0.012052226
## [5] {sausage, whole milk} => {yogurt} 0.001473050 0.1641791 0.008972213
## lift count
## [1] 1.441559 16
## [2] 1.250529 17
## [3] 1.472192 19
## [4] 1.355966 21
## [5] 1.908183 22
Unfortunately, yogurt was found in far fewer baskets than milk, so I decided to lower the confidence level to 0.10. It turns out that also for this product, it’s difficult to rationally explain, what was the guiding principle for customers, who combined e.g. yogurt with detergent. What is more, basing on the low level of confidence statistic for all the following 5 baskets, it is hard to find any relationship between the products.
To be honest, I am not quite happy with the results that I get, so I will make some kind of a trick and recode some of the products by changing their names to more generic equivalents. Similar products will be categorized into one group, which should increase the value of confidence statistics.
# Old names
names.real<-c("whole milk", "cream cheese ", "yogurt", "root vegetables", "frozen vegetables", "other vegetables", "ham", "beef", "sausage") #
# New names
names.new <-c("diary", "diary", "diary", "vegetables", "vegetables", "vegetables", "meat", "meat", "meat")
# Recode the products
data_fe <- as.data.frame(converted_grocery)
data_fe$itemDescription <- plyr::mapvalues(data_fe$itemDescription, names.real, names.new)
# Check
head(data_fe[data_fe$itemDescription %in% names.new, ])
## Member_number Year Month Day itemDescription Date Occurencies
## 3 1000 2020 03 15 meat 2020-03-15 13
## 4 1000 2020 11 25 meat 2020-11-25 13
## 5 1000 2020 03 15 diary 2020-03-15 13
## 6 1000 2020 06 24 diary 2020-06-24 13
## 11 1000 2020 03 15 diary 2020-03-15 13
## 16 1001 2020 04 14 meat 2020-04-14 12
As before, I will create some baskets.
items_fe <- plyr::ddply(data_fe, c("Member_number","Date"), function(temp_df)paste(temp_df$itemDescription,collapse = ","))
items_fe <- subset(items_fe, select = V1)
write.csv(items_fe, file = "transactions_fe.csv", quote = FALSE, row.names = T)
basket_df_fe <- arules::read.transactions("transactions_fe.csv", format = "basket", sep = ",", cols = 1)
print(basket_df_fe) # The number of rows hasn't changed, but the number of columns has decreased by 7.
## transactions in sparse format with
## 14935 transactions (rows) and
## 161 items (columns)
basket_apriori_fe <- apriori(basket_df_fe, parameter = list(sup = 0.001, conf = 0.05, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.05 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[161 item(s), 14935 transaction(s)] done [0.00s].
## sorting and recoding items ... [142 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [482 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(basket_apriori_fe)
## set of 482 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 374 108
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.224 2.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001004 Min. :0.05000 Min. :0.002946 Min. :0.4994
## 1st Qu.:0.001272 1st Qu.:0.08146 1st Qu.:0.010194 1st Qu.:0.7409
## Median :0.001875 Median :0.12320 Median :0.017944 Median :0.8173
## Mean :0.003317 Mean :0.13410 Mean :0.028575 Mean :0.8583
## 3rd Qu.:0.003415 3rd Qu.:0.18154 3rd Qu.:0.034985 3rd Qu.:0.9306
## Max. :0.040911 Max. :0.34091 Max. :0.252293 Max. :1.7095
## count
## Min. : 15.00
## 1st Qu.: 19.00
## Median : 28.00
## Mean : 49.54
## 3rd Qu.: 51.00
## Max. :611.00
##
## mining info:
## data ntransactions support confidence
## basket_df_fe 14935 0.001 0.05
## call
## apriori(data = basket_df_fe, parameter = list(sup = 0.001, conf = 0.05, minlen = 2))
inspect(head(sort(basket_apriori_fe, by = "confidence", decreasing = TRUE)))
## lhs rhs support confidence coverage
## [1] {frankfurter, meat} => {vegetables} 0.001004352 0.3409091 0.002946100
## [2] {meat, pork} => {diary} 0.001004352 0.3191489 0.003146970
## [3] {finished products} => {diary} 0.001205223 0.2812500 0.004285236
## [4] {citrus fruit, meat} => {diary} 0.001205223 0.2769231 0.004352193
## [5] {meat, shopping bags} => {diary} 0.001138266 0.2741935 0.004151322
## [6] {meat, pastry} => {diary} 0.001540007 0.2738095 0.005624372
## lift count
## [1] 1.625631 15
## [2] 1.264992 15
## [3] 1.114774 18
## [4] 1.097624 18
## [5] 1.086805 17
## [6] 1.085283 23
Analyzing the above results, we can see that I managed to artificially increase the value of the confidence statistic, even by 10% by renaming only 9 products. Compared to the first iteration of the algorithm implementation, the initial shopping baskets have also changed. We can also notice that in many of them, meat products have appeared for the first time. Therefore, I think that creating another group, based on meat products would improve the results even more.
In the above analysis, the Association rules algorithm was used to define, what next product the customers would be inclined to choose based on their current basket of goods. On the basis of the conducted method, several conclusions can be drawn. The main one is that this algorithm, can tell us little, if we have multiple options (in my case, food products).
Therefore, it seems that for a supermarket manager, this algorithm would probably prove to be moderately helpful. Nevertheless, it has been shown that by artificially allocating products to certain groups, it is possible to better understand the preferences of customers on the basis of their shopping baskets. Such a trick can be helpful in establishing store departments or shelves. Very often when passing through a supermarket, we can notice that beer products are closer to snacks, which may affect the customer’s thinking. What’s more, managers can also think about introducing some sales or get rid of the goods remaining in stock by creating a special promotions like: “When purchasing product Y, you will receive a 50% discount on the purchase of product X”.