Overview

Market Basket Analysis is an important tool for many reasons. It helps us to get a lot of insights about the customer. By applying association rules and using proper libraries we are providing physical representation of preferences of the clients. This gives us an advantage, we can notice which products are being loved by customers or if there are any frequent baskets composed of many products, which are bought. Thanks to this deep dive, we are better preapared to propose various promotions, offers and to understand our client.

The dataset belongs to “The Bread Basket” a bakery located in Edinburgh. The dataset has 20507 entries, over 9000 transactions, and 4 columns. The dataset has transactions of customers who ordered different items from this bakery online and the time period of the data is from 26-01-11 to 27-12-03. Are there any patterns or can we discover interesing behaviors of consumers? Let’s find out!

Exploratory Data Analysis

It is valuable to first look how the data is designed and try to visualize it. It can provide us some significant insights.

Transactions in different day periods

As expected, people usually visit the bakery in the afternoon and morning. However, I thought that morning will record a higher frequency than afternoon or maybe there won’t be that much difference between the two times of day.

ggplot(data, aes(x=period_day, fill=period_day)) +
  geom_bar(position="dodge") +
  scale_fill_manual(values=wes_palette("GrandBudapest1")) +
  xlab("Period of the day") +
  ylab("Number of transactions") +
  labs(fill="")

Transcations in different week periods

Weekday wins with weekend. However, when we consider the number of days in these two periods, then the number of transactions per day is higher during weekends.

ggplot(data, aes(x=weekday_weekend, fill=weekday_weekend)) +
  geom_bar(position="dodge") +
  scale_fill_manual(values=wes_palette("GrandBudapest1")) +
  xlab("Period of the week") +
  ylab("Number of transactions") +
  labs(fill="")

Transactions over time

We can notice that the bakery had some difficult period until January. Since the beginning of the year, we can observe a positive trend.

data$date_time <- as.Date(data$date_time, format="%d-%m-%Y")

items_count <- data %>%
  group_by(date_time) %>%
  summarise(number_transactions = count(date_time))

items_count_df <- data.frame(date = items_count[[1]][1], count = items_count[[1]][2])

ggplot(items_count_df, aes(x=x, y=freq)) +
  geom_line(color="orange") +
  xlab("Month") +
  ylab("Number of transactions") 

Words cloud

Words cloud is a quite nice visual representaion of frequencies. We can take a quick look at it and come up with clear interpretations. It seems like our bakery is not unique and its most popular products are coffee, tea, break and sandwiches. However, maybe it is well known for its Medialunas (Argentinian pastry dessert made from butter and egg added to a rich bread style dough).

items <- data$Item
docs <- Corpus(VectorSource(items))
document_matrix <- TermDocumentMatrix(docs)
matrix <- as.matrix(document_matrix)
words <- sort(rowSums(matrix), decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)

wordcloud(words=df$word, 
          freq=df$freq, 
          min.freq=1,
          max.words= 150, 
          random.order=FALSE, 
          scale=c(8,2),
          colors=brewer.pal(9, "Paired"))

Association rules

Association rules are useful for discovering relationships or patterns between items in large datasets. They are commonly used in market basket analysis to identify frequent itemsets and generate rules that describe the relationships between items in a transaction.

Used metrics

A high support value indicates that the itemset is frequently occurring in the dataset, and therefore, it is considered to be important. \[ Support = \frac{Number \ of \ transactions \ with \ both \ X \ and \ Y}{Total \ number \ of \ transactions} \]

A high confidence indicates that we often added item Y to the basket, while already having X in it. \[ Confidence = \frac{Number \ of \ transactions \ with \ both \ X \ and \ Y}{Total \ number \ of \ transactions\ with \ X} \]

A lift value greater than 1 indicates that the items are positively associated and that the presence of one item is likely to be associated with the presence of the other item. A lift value less than 1 indicates that the items are negatively associated, and a lift value of 1 indicates that there is no association between the items. \[ Lift = \frac{Number \ of \ transactions \ with \ both \ X \ and \ Y / No. \ of \ transactions \ with \ X}{Fraction \ of \ transactions \ with \ Y} \]

Analysis



Examples of transactions

transactions <- read.transactions("items.csv", format="basket", sep=",", col=1, skip=1)
arules::inspect(head(transactions))
##     items                         transactionID
## [1] {Bread}                       1            
## [2] {Scandinavian}                2            
## [3] {Cookies, Hot chocolate, Jam} 3            
## [4] {Muffin}                      4            
## [5] {Bread, Coffee, Pastry}       5            
## [6] {Medialuna, Muffin, Pastry}   6



Inspecting rules after applying apriori algorithm

arules::inspect(basket_rules)
##      lhs                rhs      support    confidence coverage   lift     
## [1]  {Toast}         => {Coffee} 0.02366614 0.7044025  0.03359746 1.4730822
## [2]  {Juice}         => {Coffee} 0.02060222 0.5357143  0.03845747 1.1203128
## [3]  {Cookies}       => {Coffee} 0.02820919 0.5194553  0.05430534 1.0863111
## [4]  {Medialuna}     => {Coffee} 0.03507660 0.5684932  0.06170100 1.1888616
## [5]  {Hot chocolate} => {Coffee} 0.02958267 0.5072464  0.05832013 1.0607793
## [6]  {Sandwich}      => {Coffee} 0.03824617 0.5323529  0.07184363 1.1132834
## [7]  {Pastry}        => {Bread}  0.02916006 0.3390663  0.08600106 1.0372537
## [8]  {Pastry}        => {Coffee} 0.04743793 0.5515971  0.08600106 1.1535276
## [9]  {Cake}          => {Tea}    0.02377179 0.2288911  0.10385631 1.6059709
## [10] {Cake}          => {Bread}  0.02334918 0.2248220  0.10385631 0.6877634
## [11] {Cake}          => {Coffee} 0.05472795 0.5269583  0.10385631 1.1020018
## [12] {Tea}           => {Coffee} 0.04986793 0.3498888  0.14252509 0.7317052
## [13] {Bread}         => {Coffee} 0.09001585 0.2753717  0.32688854 0.5758712
##      count
## [1]  224  
## [2]  195  
## [3]  267  
## [4]  332  
## [5]  280  
## [6]  362  
## [7]  276  
## [8]  449  
## [9]  225  
## [10] 221  
## [11] 518  
## [12] 472  
## [13] 852



Coffee is the most popular product. As mentioned before, it seems like the bakery is specializing in Medialunas production.

itemFrequencyPlot(transactions, 
                  topN=13, 
                  ylim=c(0, 0.5), 
                  las=2, 
                  col="orange",
                  xlab="Item",
                  ylab="Item frequency")

plot(basket_rules, measure=c("support","confidence"), shading="lift", width=600, height=1200)
## Available control parameters (with default values):
## main  =  Scatter plot for 13 rules
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## jitter    =  NA
## engine    =  ggplot2
## verbose   =  FALSE



We can notice that it is really popular to buy coffee and something additional. I guess that quality of pastry is really important for the bakery. If it is going to be high and products are going to look beautiful, clients will for sure spend more money.

plot(basket_rules, method="grouped")



Good quality of coffee is the key to success. If we will provide the best coffee experience in the city, poeple will visit us and their basket will not only consist of the brown liquid.

plot(basket_rules, method="graph")



Another type of rules visualization - parallel plot, once again showing that good coffee is the key to success.

plot(basket_rules, method="paracoord", control=list(reorder=TRUE))



We can also generate interactive representation of association rules. There are 13 generated rules with calculated statistics.

inspectDT(basket_rules)

Conclusion

If we notice how much information we retrieved from 4 columns of a data frame, it is really satisfactory. Association Rules are really beneficial for companies, which sell products.
We were able to detect significant patterns. Providing this kind of analysis to the stakeholders of the company can deliver impactful insights and help to adjust marketing or selling campaigns.
Additionaly, the visualizations are really clear and easy to interpret.