1 Introduction

Problem Statement

Market Basket Analysis is a method used by large retailers to discover product correlations. Purchase behavior can be well determined through constant checks on items that frequently appear together in transactions.

Addressing the Problem Statement

Algorithm: Apriori algorithm helps in conducting this kind of analysis. It generates association rules that imply if an item X occurs, then item Y also occurs with a certain probability.

More about association rules and related terminologies can be found here.

Data: The dataset can be accessed from here.

Proposed Approach

Since the itemsets need to be built as individual objects of transaction type, Apriori algorithm approach is the correct way to go about this problem.

Benefits

The algorithm helps in mining frequent itemsets that are necessary to be able to produce relevant association rules between products. The users of this analysis, i.e. retail companies will henceforth be able to apply the outputs in the following situation:

Where should an item be placed in a store to maximize sales?
Which items are frequently bought together?
Does brand of an item make a difference when bought with another item?
How does demographics affect the purchase behavior of customers?

2 Packages Required

Load Packages upfront

All the packages are loaded upfront to provide replication convenience to the readers.

# Load libraries
library(tidyverse) 
library(lubridate) 
library(arules) 
library(arulesViz)
library(ggplot2)

Messages / Warnings

All messages and warnings have been suppressed from package loading activity.

**Purpose of Packages

tidyverse: It is used for data manipulation
arules: mine frequent itemsets and association rules and also generate transaction objects
arulesViz: visualize the association rules that have been generated
lubridate: operate with times and dates

3 Data Preparation

Original Source

The dataset was originally sourced from here

Data description

# read csv file
data <- read.csv("C:/Users/Center/Downloads/archive (1)/BreadBasket_DMS.csv", stringsAsFactors = TRUE)

# check dataset structure for datatype
str(data)

## 'data.frame':    21293 obs. of  4 variables:
##  $ Date       : Factor w/ 159 levels "2016-10-30","2016-10-31",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Time       : Factor w/ 8240 levels "01:21:05","07:29:57",..: 1236 1345 1345 1379 1379 1379 1389 1466 1466 1466 ...
##  $ Transaction: int  1 2 2 3 3 3 4 5 5 5 ...
##  $ Item       : Factor w/ 95 levels "Adjustment","Afternoon with the baker",..: 12 76 76 49 50 27 61 24 67 12 ...

The dataset contains real-world transaction data from a local grocery outlet. There are 6614 transactions and 104 items aggregated into categories.

Date
Time
Transaction
Item

The original sourcing of dataset was done to develop association rule package arules which provides a basic infrastructure for creating and manipulating inputs datasets and for analyzing the resulting itemsets and rules.

The dataset helped in the development of the following:

Apriori algorithm
Eclat algorithm

Both these algorithms are used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.

Data Assessment

The data was imported in data variable through read.csv() syntax.

Date. Character variable for the transactions date (YYYY-MM-DD format). The column includes dates from 29/10/2016 to 08/04/2017.
Time. Character variable that tells us the time of the transactions (HH:MM:SS format).
Transaction. Quantitative variable that allows us to differentiate the transactions. The rows that share the same value in this field belong to the same transaction, that’s why the data set has less transactions than observations.
Item. Character variable containing the products.

Looking at the summary, we notice that the Date and Time columns are of string datatype which calls for data type conversion.

# present dataset 5 point summary
summary(data)

##          Date             Time        Transaction        Item     
##  2017-02-04:  302   12:07:39:   16   Min.   :   1   Coffee :5471  
##  2016-11-05:  283   10:45:21:   13   1st Qu.:2548   Bread  :3325  
##  2017-03-04:  265   10:55:19:   13   Median :5067   Tea    :1435  
##  2017-03-25:  254   14:38:01:   13   Mean   :4952   Cake   :1025  
##  2017-01-28:  243   13:43:08:   12   3rd Qu.:7329   Pastry : 856  
##  2017-02-18:  240   14:19:47:   12   Max.   :9684   NONE   : 786  
##  (Other)   :19706   (Other) :21214                  (Other):8395

cat('\n')

# check for any NA values column-wise
colSums(is.na(data))

##        Date        Time Transaction        Item 
##           0           0           0           0

Even though there are no NULL values in the dataset, the Item sets contain NONE as a value which probably is because of measurement errors. Thus, the rows are dropped since it is insignificantly small when compared to the size of the actual itemsets. Also, the name of the Transaction is modified to TransactionID for operations.

# simplify column names
colnames(data) <- c('date', 'time', 'transactionID', 'item')

# change date and hour datatypes
data <- data %>% mutate(date = as.Date(date), time = hms(time))

# Drop NONE values
data$item[data$item == "NONE"] <- NA
data <- data %>% drop_na()
summary(data)

##       date                 time                           transactionID 
##  Min.   :2016-10-30   Min.   :1H 21M 5S                   Min.   :   1  
##  1st Qu.:2016-12-03   1st Qu.:10H 56M 44S                 1st Qu.:2552  
##  Median :2017-01-22   Median :12H 38M 6S                  Median :5137  
##  Mean   :2017-01-17   Mean   :12H 46M 1.52235821914655S   Mean   :4976  
##  3rd Qu.:2017-02-28   3rd Qu.:14H 32M 14.5S               3rd Qu.:7357  
##  Max.   :2017-04-09   Max.   :23H 38M 41S                 Max.   :9684  
##                                                                         
##        item     
##  Coffee  :5471  
##  Bread   :3325  
##  Tea     :1435  
##  Cake    :1025  
##  Pastry  : 856  
##  Sandwich: 771  
##  (Other) :7624

# Calculate the difference in time between the dataset start and end dates
cat('\n')

difftime(data$date[nrow(data)], data$date[1])

## Time difference of 161 days

Final Dataset

Here’s the final cleaned dataset:

# provide overview of the cleaned dataset
head(data)

Summary Information

Numerical summary is presented through 5 point summary statistics which suggests the following:

The entire dataset is collected for 161 days
Most sales happen at 12.45 pm in the afternoon
Coffee is the most ordered item!

4 Proposed Exploratory Data Analysis

Uncovering new information

The data can be sliced and mutated to obtain relevant information through visualizations. dplyr package can help with data manipulation to provide useful insights through curated dataset. New variables are used to store the manipulated dataframes which are further used to create visualizations.

4.1 Plots and Tables

Histograms and Barcharts (horizontal and vertical) will be useful in representing most of the data as shown in the upcoming tabs.

4.1.1 Item Frequency

To observe new information, we group the dataset based on variables and summarize the count of items. One such infomation is about the frequency of items across the transactionIDs.

# prepare visualization for identifying item frequency
viz1 <- data %>%
  group_by(item) %>%
  summarize(Count = n()) %>%
  arrange(desc(Count)) %>%
  slice(1:10) %>%
  ggplot(aes(x = reorder(item, Count), y = Count, fill = item)) + 
  geom_bar(stat = 'identity') + 
  coord_flip() + 
  ggtitle('Most popular line items') + 
  theme(legend.position = "none")

viz1

The top 10 Items based on counts in the transaction items where coffee is very high in demand.

4.1.2 Sales over time

Further, to understand the sales over time for the items, the dataset is grouped by date and mutated to represent count of items summarized over the dates across the entire dataset.

# prepare dataset for presenting sales over time
viz2 <- data %>%
  group_by(date) %>%
  summarise(Count = n()) %>%
  mutate(Day = wday(date, label = T)) %>% 
  ggplot(aes(x = date, y = Count, fill = Day)) +
  geom_bar(stat = "identity") +
  ggtitle("Line items sold per day")
viz2

Saturday is the biggest sales day followed by Sunday with the maximum line items being sold in February.

4.1.3 Total Unique transactions per weekday

We can also calculate the Total unique transactions per weekday to establish the days of highest and lowest unique orders.

# calculate items
items <- data %>%
  mutate(Day = wday(date, label = T)) %>%
  group_by(Day) %>%
  summarise(Count = n())

# calculate unique_transaction counts
unique_transactions <- data %>%
  mutate(wday = wday(date, label = T)) %>%
  group_by(wday, transactionID) %>%
  summarise(n_distinct(transactionID)) %>%
  summarise(Count = n())

# calculate the overall dataset containing Items / transaction
overall <- data.frame(items, unique_transactions[2], items[2] / unique_transactions[2])
colnames(overall) <- c("Day", "Line", "Unique", "Items.Trans")

# perform visualization for the overall dataset
ggplot(overall, aes(x = Day, y = Items.Trans, fill = Day)) +
  geom_bar(stat = "identity") +
  ggtitle("Total unique transactions per weekday") +
  theme(legend.position = "none") +
  geom_text(aes(label = round(Items.Trans, 1)), vjust = 2)

Highest number of items per transaction ID indicates more sale on Sundays!

4.1.4 Hourly Sales

viz1 <- data %>% 
  mutate(Hour = as.factor(hour(time)))%>% 
  group_by(Hour) %>% 
  summarise(Count= n()) 

viz2 <- data %>% 
  mutate(Hour = as.factor(hour(time)))%>% 
  group_by(Hour, transactionID) %>% 
  summarise(n_distinct(transactionID)) %>% 
  summarise(Count=n())

generic_viz <- data.frame(viz1, viz2[2], viz1[2]/viz2[2])  # items per unique transaction
colnames(generic_viz) <- c("Hour", "Line", "Unique", "Items.Trans")

viz4 <- 
  ggplot(generic_viz,aes(x = Hour, y = Items.Trans, fill = Hour)) +
  geom_bar(stat ="identity") +
  ggtitle("Total items per transaction per hour") +
  theme(legend.position = "none") +
  geom_text(aes(label = round(Items.Trans, 1)), vjust = 2)
viz4

The most items per transaction is made by people between 10am and 5pm with 2 pm as peak time

5 Data Modeling

Apriori algorithm

5.1 Selecting support and confidence levels

The read.transactions() command reads the transaction csv data file from the url and creates a >transactions object.

The optimum support and confidence levels must be determined before creating a set of association rules. Most of the rules generated will be rendered useless if we set the values too low. We can either perform a grid search to identify optimum support and confidence or split them into range and test the total rules generated for each combination.

# load data as transactions object
data <- read.transactions("C:/Users/Center/Downloads/archive (1)/BreadBasket_DMS.csv", format = "single", cols = c(3,4), sep = ",", rm.duplicates = TRUE)

# Initial values
suplvl <- c(0.1, 0.05, 0.01, 0.005)
conflvl <- c(0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)

# blank integers 
rules_s10 <- integer(length = 9)
rules_s5 <- integer(length = 9)
rules_s1 <- integer(length = 9)
rules_s0.5 <- integer(length = 9)

# Apriori algorithm : support level of 10%
for (i in 1:length(conflvl)) {
  rules_s10[i] <- length(apriori(data, parameter = list(sup = suplvl[1], 
                                   conf = conflvl[i], target = "rules")))
}

# Apriori algorithm : support level of 5%
for (i in 1:length(conflvl)){
  rules_s5[i] <- length(apriori(data, parameter = list(sup = suplvl[2], 
                                  conf = conflvl[i], target = "rules")))
}

# Apriori algorithm : support level of 1%
for (i in 1:length(conflvl)){
  rules_s1[i] <- length(apriori(data, parameter = list(sup = suplvl[3], 
                                  conf = conflvl[i], target = "rules")))
}

# Apriori algorithm : support level of 0.5%
for (i in 1:length(conflvl)){
  rules_s0.5[i] <- length(apriori(data, parameter = list(sup = suplvl[4], 
                                    conf = conflvl[i], target = "rules")))
}

The number of rules generated with a support level of 10%, 5%, 1% and 0.5% are shown as below

# Data frame
num_rules <- data.frame(rules_s10, rules_s5, rules_s1, rules_s0.5, conflvl)

# Count of rules found with a support level of 10%, 5%, 1% and 0.5%
ggplot(data=num_rules, aes(x=conflvl)) +
  geom_line(aes(y=rules_s10, colour="Support level of 10%")) + 
  geom_point(aes(y=rules_s10, colour="Support level of 10%")) +
  geom_line(aes(y=rules_s5, colour="Support level of 5%")) +
  geom_point(aes(y=rules_s5, colour="Support level of 5%")) +
  geom_line(aes(y=rules_s1, colour="Support level of 1%")) + 
  geom_point(aes(y=rules_s1, colour="Support level of 1%")) +
  geom_line(aes(y=rules_s0.5, colour="Support level of 0.5%")) +
  geom_point(aes(y=rules_s0.5, colour="Support level of 0.5%")) +
  labs(x="Confidence levels", y="Count of rules found", 
       title="Apriori algorithm with different support levels")

The results are as follows:

Support level of 10%. The rules are generated with very low confidence level.
Support level of 5%. We must look for support levels below 5% for rules with reasonable level of confidence.
Support level of 1%. 13 rules have a confidence of at least 50%.
Support level of 0.5%. Too many rules to analyze!

Thus, support level of 1% and a confidence level of 50%.

5.2 Finalization

final_rule <- apriori(data, parameter = list(sup = suplvl[3], 
                             conf = conflvl[5], target = "rules"))

inspect(final_rule)

##      lhs                 rhs      support    confidence coverage   lift    
## [1]  {Tiffin}         => {Coffee} 0.01058361 0.5468750  0.01935289 1.134577
## [2]  {Spanish Brunch} => {Coffee} 0.01406108 0.6326531  0.02222558 1.312537
## [3]  {Scone}          => {Coffee} 0.01844572 0.5422222  0.03401875 1.124924
## [4]  {Toast}          => {Coffee} 0.02570305 0.7296137  0.03522830 1.513697
## [5]  {Alfajores}      => {Coffee} 0.02237678 0.5522388  0.04052011 1.145705
## [6]  {Juice}          => {Coffee} 0.02131842 0.5300752  0.04021772 1.099723
## [7]  {Hot chocolate}  => {Coffee} 0.02721500 0.5263158  0.05170850 1.091924
## [8]  {Medialuna}      => {Coffee} 0.03296039 0.5751979  0.05730269 1.193337
## [9]  {Cookies}        => {Coffee} 0.02978530 0.5267380  0.05654672 1.092800
## [10] {NONE}           => {Coffee} 0.04172966 0.5810526  0.07181736 1.205484
## [11] {Sandwich}       => {Coffee} 0.04233444 0.5679513  0.07453886 1.178303
## [12] {Pastry}         => {Coffee} 0.04868461 0.5590278  0.08708800 1.159790
## [13] {Cake}           => {Coffee} 0.05654672 0.5389049  0.10492894 1.118042
##      count
## [1]   70  
## [2]   93  
## [3]  122  
## [4]  170  
## [5]  148  
## [6]  141  
## [7]  180  
## [8]  218  
## [9]  197  
## [10] 276  
## [11] 280  
## [12] 322  
## [13] 374

The rules can be interpreted as follows:

52% of the customers who bought a hot chocolate also bought a coffee.
63% of the customers who bought a spanish brunch also bought a coffee.
73% of the customers who bought a toast also bought a coffee.

5.3 Visualizatons of association rules

plot(final_rule, measure = c("support", "lift"), shading = "confidence")

# Graph (circular layout)
plot(final_rule, method = "graph", control = list(layout = igraph::in_circle()))

6 Summary

Following is the summary of analysis obtained throughout the project!

Saturday is the biggest sales day followed by Sunday with the maximum line items being sold in February.
The top 10 Items based on counts in the transaction items where coffee is very high in demand.
Highest number of items per transaction ID indicates more sale on Sundays
The most items per transaction is made by people between 10am and 5pm with 2 pm as peak time
Support level of 1% and a confidence level of 50%.
52% of the customers who bought a hot chocolate also bought a coffee.
63% of the customers who bought a spanish brunch also bought a coffee.
73% of the customers who bought a toast also bought a coffee.

Further advanced visualizations can be used to develop targeted marketing approach for customers based on their age and buying propensity.

7 References

Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse

Michael Hahsler, Bettina Gruen and Kurt Hornik (2005), arules - A Computational Environment for Mining Association Rules and Frequent Item Sets. Journal of Statistical Software 14/15. URL: http://dx.doi.org/10.18637/jss.v014.i15.

Market Basket Analysis in R

Durgesh Mishra

2021-08-06