Problem Statement
Market Basket Analysis is a method used by large retailers to discover product correlations. Purchase behavior can be well determined through constant checks on items that frequently appear together in transactions.
Addressing the Problem Statement
More about association rules and related terminologies can be found here.
Proposed Approach
Since the itemsets need to be built as individual objects of transaction type, Apriori algorithm approach is the correct way to go about this problem.
Benefits
The algorithm helps in mining frequent itemsets that are necessary to be able to produce relevant association rules between products. The users of this analysis, i.e. retail companies will henceforth be able to apply the outputs in the following situation:
Load Packages upfront
All the packages are loaded upfront to provide replication convenience to the readers.
# Load libraries
library(tidyverse)
library(lubridate)
library(arules)
library(arulesViz)
library(ggplot2)Messages / Warnings
All messages and warnings have been suppressed from package loading activity.
**Purpose of Packages
tidyverse: It is used for data manipulationarules: mine frequent itemsets and association rules and also generate transaction objectsarulesViz: visualize the association rules that have been generatedlubridate: operate with times and datesOriginal Source
The dataset was originally sourced from here
Data description
# read csv file
data <- read.csv("C:/Users/Center/Downloads/archive (1)/BreadBasket_DMS.csv", stringsAsFactors = TRUE)# check dataset structure for datatype
str(data)## 'data.frame': 21293 obs. of 4 variables:
## $ Date : Factor w/ 159 levels "2016-10-30","2016-10-31",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Time : Factor w/ 8240 levels "01:21:05","07:29:57",..: 1236 1345 1345 1379 1379 1379 1389 1466 1466 1466 ...
## $ Transaction: int 1 2 2 3 3 3 4 5 5 5 ...
## $ Item : Factor w/ 95 levels "Adjustment","Afternoon with the baker",..: 12 76 76 49 50 27 61 24 67 12 ...
The dataset contains real-world transaction data from a local grocery outlet. There are 6614 transactions and 104 items aggregated into categories.
The original sourcing of dataset was done to develop association rule package arules which provides a basic infrastructure for creating and manipulating inputs datasets and for analyzing the resulting itemsets and rules.
The dataset helped in the development of the following:
Both these algorithms are used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.
Data Assessment
The data was imported in data variable through read.csv() syntax.
Date. Character variable for the transactions date (YYYY-MM-DD format). The column includes dates from 29/10/2016 to 08/04/2017.
Time. Character variable that tells us the time of the transactions (HH:MM:SS format).
Transaction. Quantitative variable that allows us to differentiate the transactions. The rows that share the same value in this field belong to the same transaction, that’s why the data set has less transactions than observations.
Item. Character variable containing the products.
Looking at the summary, we notice that the Date and Time columns are of string datatype which calls for data type conversion.
# present dataset 5 point summary
summary(data)## Date Time Transaction Item
## 2017-02-04: 302 12:07:39: 16 Min. : 1 Coffee :5471
## 2016-11-05: 283 10:45:21: 13 1st Qu.:2548 Bread :3325
## 2017-03-04: 265 10:55:19: 13 Median :5067 Tea :1435
## 2017-03-25: 254 14:38:01: 13 Mean :4952 Cake :1025
## 2017-01-28: 243 13:43:08: 12 3rd Qu.:7329 Pastry : 856
## 2017-02-18: 240 14:19:47: 12 Max. :9684 NONE : 786
## (Other) :19706 (Other) :21214 (Other):8395
cat('\n')# check for any NA values column-wise
colSums(is.na(data))## Date Time Transaction Item
## 0 0 0 0
Even though there are no NULL values in the dataset, the Item sets contain NONE as a value which probably is because of measurement errors. Thus, the rows are dropped since it is insignificantly small when compared to the size of the actual itemsets. Also, the name of the Transaction is modified to TransactionID for operations.
# simplify column names
colnames(data) <- c('date', 'time', 'transactionID', 'item')
# change date and hour datatypes
data <- data %>% mutate(date = as.Date(date), time = hms(time))
# Drop NONE values
data$item[data$item == "NONE"] <- NA
data <- data %>% drop_na()
summary(data)## date time transactionID
## Min. :2016-10-30 Min. :1H 21M 5S Min. : 1
## 1st Qu.:2016-12-03 1st Qu.:10H 56M 44S 1st Qu.:2552
## Median :2017-01-22 Median :12H 38M 6S Median :5137
## Mean :2017-01-17 Mean :12H 46M 1.52235821914655S Mean :4976
## 3rd Qu.:2017-02-28 3rd Qu.:14H 32M 14.5S 3rd Qu.:7357
## Max. :2017-04-09 Max. :23H 38M 41S Max. :9684
##
## item
## Coffee :5471
## Bread :3325
## Tea :1435
## Cake :1025
## Pastry : 856
## Sandwich: 771
## (Other) :7624
# Calculate the difference in time between the dataset start and end dates
cat('\n')difftime(data$date[nrow(data)], data$date[1])## Time difference of 161 days
Final Dataset
Here’s the final cleaned dataset:
# provide overview of the cleaned dataset
head(data)Summary Information
Numerical summary is presented through 5 point summary statistics which suggests the following:
Uncovering new information
The data can be sliced and mutated to obtain relevant information through visualizations. dplyr package can help with data manipulation to provide useful insights through curated dataset. New variables are used to store the manipulated dataframes which are further used to create visualizations.
Histograms and Barcharts (horizontal and vertical) will be useful in representing most of the data as shown in the upcoming tabs.
To observe new information, we group the dataset based on variables and summarize the count of items. One such infomation is about the frequency of items across the transactionIDs.
# prepare visualization for identifying item frequency
viz1 <- data %>%
group_by(item) %>%
summarize(Count = n()) %>%
arrange(desc(Count)) %>%
slice(1:10) %>%
ggplot(aes(x = reorder(item, Count), y = Count, fill = item)) +
geom_bar(stat = 'identity') +
coord_flip() +
ggtitle('Most popular line items') +
theme(legend.position = "none")
viz1The top 10 Items based on counts in the transaction items where coffee is very high in demand.
Further, to understand the sales over time for the items, the dataset is grouped by date and mutated to represent count of items summarized over the dates across the entire dataset.
# prepare dataset for presenting sales over time
viz2 <- data %>%
group_by(date) %>%
summarise(Count = n()) %>%
mutate(Day = wday(date, label = T)) %>%
ggplot(aes(x = date, y = Count, fill = Day)) +
geom_bar(stat = "identity") +
ggtitle("Line items sold per day")
viz2Saturday is the biggest sales day followed by Sunday with the maximum line items being sold in February.
We can also calculate the Total unique transactions per weekday to establish the days of highest and lowest unique orders.
# calculate items
items <- data %>%
mutate(Day = wday(date, label = T)) %>%
group_by(Day) %>%
summarise(Count = n())
# calculate unique_transaction counts
unique_transactions <- data %>%
mutate(wday = wday(date, label = T)) %>%
group_by(wday, transactionID) %>%
summarise(n_distinct(transactionID)) %>%
summarise(Count = n())
# calculate the overall dataset containing Items / transaction
overall <- data.frame(items, unique_transactions[2], items[2] / unique_transactions[2])
colnames(overall) <- c("Day", "Line", "Unique", "Items.Trans")
# perform visualization for the overall dataset
ggplot(overall, aes(x = Day, y = Items.Trans, fill = Day)) +
geom_bar(stat = "identity") +
ggtitle("Total unique transactions per weekday") +
theme(legend.position = "none") +
geom_text(aes(label = round(Items.Trans, 1)), vjust = 2)Highest number of items per transaction ID indicates more sale on Sundays!
viz1 <- data %>%
mutate(Hour = as.factor(hour(time)))%>%
group_by(Hour) %>%
summarise(Count= n())
viz2 <- data %>%
mutate(Hour = as.factor(hour(time)))%>%
group_by(Hour, transactionID) %>%
summarise(n_distinct(transactionID)) %>%
summarise(Count=n())
generic_viz <- data.frame(viz1, viz2[2], viz1[2]/viz2[2]) # items per unique transaction
colnames(generic_viz) <- c("Hour", "Line", "Unique", "Items.Trans")
viz4 <-
ggplot(generic_viz,aes(x = Hour, y = Items.Trans, fill = Hour)) +
geom_bar(stat ="identity") +
ggtitle("Total items per transaction per hour") +
theme(legend.position = "none") +
geom_text(aes(label = round(Items.Trans, 1)), vjust = 2)
viz4The most items per transaction is made by people between 10am and 5pm with 2 pm as peak time
Apriori algorithm
The read.transactions() command reads the transaction csv data file from the url and creates a >transactions object.
The optimum support and confidence levels must be determined before creating a set of association rules. Most of the rules generated will be rendered useless if we set the values too low. We can either perform a grid search to identify optimum support and confidence or split them into range and test the total rules generated for each combination.
# load data as transactions object
data <- read.transactions("C:/Users/Center/Downloads/archive (1)/BreadBasket_DMS.csv", format = "single", cols = c(3,4), sep = ",", rm.duplicates = TRUE)
# Initial values
suplvl <- c(0.1, 0.05, 0.01, 0.005)
conflvl <- c(0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)
# blank integers
rules_s10 <- integer(length = 9)
rules_s5 <- integer(length = 9)
rules_s1 <- integer(length = 9)
rules_s0.5 <- integer(length = 9)
# Apriori algorithm : support level of 10%
for (i in 1:length(conflvl)) {
rules_s10[i] <- length(apriori(data, parameter = list(sup = suplvl[1],
conf = conflvl[i], target = "rules")))
}
# Apriori algorithm : support level of 5%
for (i in 1:length(conflvl)){
rules_s5[i] <- length(apriori(data, parameter = list(sup = suplvl[2],
conf = conflvl[i], target = "rules")))
}
# Apriori algorithm : support level of 1%
for (i in 1:length(conflvl)){
rules_s1[i] <- length(apriori(data, parameter = list(sup = suplvl[3],
conf = conflvl[i], target = "rules")))
}
# Apriori algorithm : support level of 0.5%
for (i in 1:length(conflvl)){
rules_s0.5[i] <- length(apriori(data, parameter = list(sup = suplvl[4],
conf = conflvl[i], target = "rules")))
}The number of rules generated with a support level of 10%, 5%, 1% and 0.5% are shown as below
# Data frame
num_rules <- data.frame(rules_s10, rules_s5, rules_s1, rules_s0.5, conflvl)
# Count of rules found with a support level of 10%, 5%, 1% and 0.5%
ggplot(data=num_rules, aes(x=conflvl)) +
geom_line(aes(y=rules_s10, colour="Support level of 10%")) +
geom_point(aes(y=rules_s10, colour="Support level of 10%")) +
geom_line(aes(y=rules_s5, colour="Support level of 5%")) +
geom_point(aes(y=rules_s5, colour="Support level of 5%")) +
geom_line(aes(y=rules_s1, colour="Support level of 1%")) +
geom_point(aes(y=rules_s1, colour="Support level of 1%")) +
geom_line(aes(y=rules_s0.5, colour="Support level of 0.5%")) +
geom_point(aes(y=rules_s0.5, colour="Support level of 0.5%")) +
labs(x="Confidence levels", y="Count of rules found",
title="Apriori algorithm with different support levels")The results are as follows:
Support level of 10%. The rules are generated with very low confidence level.
Support level of 5%. We must look for support levels below 5% for rules with reasonable level of confidence.
Support level of 1%. 13 rules have a confidence of at least 50%.
Support level of 0.5%. Too many rules to analyze!
Thus, support level of 1% and a confidence level of 50%.
final_rule <- apriori(data, parameter = list(sup = suplvl[3],
conf = conflvl[5], target = "rules"))inspect(final_rule)## lhs rhs support confidence coverage lift
## [1] {Tiffin} => {Coffee} 0.01058361 0.5468750 0.01935289 1.134577
## [2] {Spanish Brunch} => {Coffee} 0.01406108 0.6326531 0.02222558 1.312537
## [3] {Scone} => {Coffee} 0.01844572 0.5422222 0.03401875 1.124924
## [4] {Toast} => {Coffee} 0.02570305 0.7296137 0.03522830 1.513697
## [5] {Alfajores} => {Coffee} 0.02237678 0.5522388 0.04052011 1.145705
## [6] {Juice} => {Coffee} 0.02131842 0.5300752 0.04021772 1.099723
## [7] {Hot chocolate} => {Coffee} 0.02721500 0.5263158 0.05170850 1.091924
## [8] {Medialuna} => {Coffee} 0.03296039 0.5751979 0.05730269 1.193337
## [9] {Cookies} => {Coffee} 0.02978530 0.5267380 0.05654672 1.092800
## [10] {NONE} => {Coffee} 0.04172966 0.5810526 0.07181736 1.205484
## [11] {Sandwich} => {Coffee} 0.04233444 0.5679513 0.07453886 1.178303
## [12] {Pastry} => {Coffee} 0.04868461 0.5590278 0.08708800 1.159790
## [13] {Cake} => {Coffee} 0.05654672 0.5389049 0.10492894 1.118042
## count
## [1] 70
## [2] 93
## [3] 122
## [4] 170
## [5] 148
## [6] 141
## [7] 180
## [8] 218
## [9] 197
## [10] 276
## [11] 280
## [12] 322
## [13] 374
The rules can be interpreted as follows:
52% of the customers who bought a hot chocolate also bought a coffee.
63% of the customers who bought a spanish brunch also bought a coffee.
73% of the customers who bought a toast also bought a coffee.
plot(final_rule, measure = c("support", "lift"), shading = "confidence")# Graph (circular layout)
plot(final_rule, method = "graph", control = list(layout = igraph::in_circle()))Following is the summary of analysis obtained throughout the project!
Saturday is the biggest sales day followed by Sunday with the maximum line items being sold in February.
The top 10 Items based on counts in the transaction items where coffee is very high in demand.
Highest number of items per transaction ID indicates more sale on Sundays
The most items per transaction is made by people between 10am and 5pm with 2 pm as peak time
Support level of 1% and a confidence level of 50%.
52% of the customers who bought a hot chocolate also bought a coffee.
63% of the customers who bought a spanish brunch also bought a coffee.
73% of the customers who bought a toast also bought a coffee.
Further advanced visualizations can be used to develop targeted marketing approach for customers based on their age and buying propensity.
Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse
Michael Hahsler, Bettina Gruen and Kurt Hornik (2005), arules - A Computational Environment for Mining Association Rules and Frequent Item Sets. Journal of Statistical Software 14/15. URL: http://dx.doi.org/10.18637/jss.v014.i15.