Market Basket Analysis is an important tool for many reasons. It helps us to get a lot of insights about the customer. By applying association rules and using proper libraries we are providing physical representation of preferences of the clients. This gives us an advantage, we can notice which products are being loved by customers or if there are any frequent baskets composed of many products, which are bought. Thanks to this deep dive, we are better preapared to propose various promotions, offers and to understand our client.
The dataset belongs to “The Bread Basket” a bakery located in Edinburgh. The dataset has 20507 entries, over 9000 transactions, and 4 columns. The dataset has transactions of customers who ordered different items from this bakery online and the time period of the data is from 26-01-11 to 27-12-03. Are there any patterns or can we discover interesing behaviors of consumers? Let’s find out!
It is valuable to first look how the data is designed and try to visualize it. It can provide us some significant insights.
As expected, people usually visit the bakery in the afternoon and morning. However, I thought that morning will record a higher frequency than afternoon or maybe there won’t be that much difference between the two times of day.
ggplot(data, aes(x=period_day, fill=period_day)) +
geom_bar(position="dodge") +
scale_fill_manual(values=wes_palette("GrandBudapest1")) +
xlab("Period of the day") +
ylab("Number of transactions") +
labs(fill="")
Weekday wins with weekend. However, when we consider the number of days in these two periods, then the number of transactions per day is higher during weekends.
ggplot(data, aes(x=weekday_weekend, fill=weekday_weekend)) +
geom_bar(position="dodge") +
scale_fill_manual(values=wes_palette("GrandBudapest1")) +
xlab("Period of the week") +
ylab("Number of transactions") +
labs(fill="")
We can notice that the bakery had some difficult period until January. Since the beginning of the year, we can observe a positive trend.
data$date_time <- as.Date(data$date_time, format="%d-%m-%Y")
items_count <- data %>%
group_by(date_time) %>%
summarise(number_transactions = count(date_time))
items_count_df <- data.frame(date = items_count[[1]][1], count = items_count[[1]][2])
ggplot(items_count_df, aes(x=x, y=freq)) +
geom_line(color="orange") +
xlab("Month") +
ylab("Number of transactions")
Words cloud is a quite nice visual representaion of frequencies. We can take a quick look at it and come up with clear interpretations. It seems like our bakery is not unique and its most popular products are coffee, tea, break and sandwiches. However, maybe it is well known for its Medialunas (Argentinian pastry dessert made from butter and egg added to a rich bread style dough).
items <- data$Item
docs <- Corpus(VectorSource(items))
document_matrix <- TermDocumentMatrix(docs)
matrix <- as.matrix(document_matrix)
words <- sort(rowSums(matrix), decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
wordcloud(words=df$word,
freq=df$freq,
min.freq=1,
max.words= 150,
random.order=FALSE,
scale=c(8,2),
colors=brewer.pal(9, "Paired"))
Association rules are useful for discovering relationships or patterns between items in large datasets. They are commonly used in market basket analysis to identify frequent itemsets and generate rules that describe the relationships between items in a transaction.
A high support value indicates that the itemset is frequently occurring in the dataset, and therefore, it is considered to be important. \[ Support = \frac{Number \ of \ transactions \ with \ both \ X \ and \ Y}{Total \ number \ of \ transactions} \]
A high confidence indicates that we often added item Y to the basket, while already having X in it. \[ Confidence = \frac{Number \ of \ transactions \ with \ both \ X \ and \ Y}{Total \ number \ of \ transactions\ with \ X} \]
A lift value greater than 1 indicates that the items are positively associated and that the presence of one item is likely to be associated with the presence of the other item. A lift value less than 1 indicates that the items are negatively associated, and a lift value of 1 indicates that there is no association between the items. \[ Lift = \frac{Number \ of \ transactions \ with \ both \ X \ and \ Y / No. \ of \ transactions \ with \ X}{Fraction \ of \ transactions \ with \ Y} \]
Examples of transactions
transactions <- read.transactions("items.csv", format="basket", sep=",", col=1, skip=1)
arules::inspect(head(transactions))
## items transactionID
## [1] {Bread} 1
## [2] {Scandinavian} 2
## [3] {Cookies, Hot chocolate, Jam} 3
## [4] {Muffin} 4
## [5] {Bread, Coffee, Pastry} 5
## [6] {Medialuna, Muffin, Pastry} 6
Inspecting rules after applying apriori algorithm
arules::inspect(basket_rules)
## lhs rhs support confidence coverage lift
## [1] {Toast} => {Coffee} 0.02366614 0.7044025 0.03359746 1.4730822
## [2] {Juice} => {Coffee} 0.02060222 0.5357143 0.03845747 1.1203128
## [3] {Cookies} => {Coffee} 0.02820919 0.5194553 0.05430534 1.0863111
## [4] {Medialuna} => {Coffee} 0.03507660 0.5684932 0.06170100 1.1888616
## [5] {Hot chocolate} => {Coffee} 0.02958267 0.5072464 0.05832013 1.0607793
## [6] {Sandwich} => {Coffee} 0.03824617 0.5323529 0.07184363 1.1132834
## [7] {Pastry} => {Bread} 0.02916006 0.3390663 0.08600106 1.0372537
## [8] {Pastry} => {Coffee} 0.04743793 0.5515971 0.08600106 1.1535276
## [9] {Cake} => {Tea} 0.02377179 0.2288911 0.10385631 1.6059709
## [10] {Cake} => {Bread} 0.02334918 0.2248220 0.10385631 0.6877634
## [11] {Cake} => {Coffee} 0.05472795 0.5269583 0.10385631 1.1020018
## [12] {Tea} => {Coffee} 0.04986793 0.3498888 0.14252509 0.7317052
## [13] {Bread} => {Coffee} 0.09001585 0.2753717 0.32688854 0.5758712
## count
## [1] 224
## [2] 195
## [3] 267
## [4] 332
## [5] 280
## [6] 362
## [7] 276
## [8] 449
## [9] 225
## [10] 221
## [11] 518
## [12] 472
## [13] 852
Coffee is the most popular product. As mentioned before, it
seems like the bakery is specializing in Medialunas production.
itemFrequencyPlot(transactions,
topN=13,
ylim=c(0, 0.5),
las=2,
col="orange",
xlab="Item",
ylab="Item frequency")
plot(basket_rules, measure=c("support","confidence"), shading="lift", width=600, height=1200)
## Available control parameters (with default values):
## main = Scatter plot for 13 rules
## colors = c("#EE0000FF", "#EEEEEEFF")
## jitter = NA
## engine = ggplot2
## verbose = FALSE
We can notice that it is really popular to buy coffee and
something additional. I guess that quality of pastry is really important
for the bakery. If it is going to be high and products are going to look
beautiful, clients will for sure spend more money.
plot(basket_rules, method="grouped")
Good quality of coffee is the key to success. If we will
provide the best coffee experience in the city, poeple will visit us and
their basket will not only consist of the brown liquid.
plot(basket_rules, method="graph")
Another type of rules visualization - parallel plot, once
again showing that good coffee is the key to success.
plot(basket_rules, method="paracoord", control=list(reorder=TRUE))
We can also generate interactive representation of
association rules. There are 13 generated rules with calculated
statistics.
inspectDT(basket_rules)
If we notice how much information we retrieved from 4 columns of a
data frame, it is really satisfactory. Association Rules are really
beneficial for companies, which sell products.
We were able to
detect significant patterns. Providing this kind of analysis to the
stakeholders of the company can deliver impactful insights and help to
adjust marketing or selling campaigns.
Additionaly, the
visualizations are really clear and easy to interpret.