Introduction

During last year a retail optimization has gained a lot of attention in the data science community. The COVID-19 pandemic resulted in restrictions that brought restrictions that changed the behavior of customers. The retail industry forced the closure of physical stores. “The Next Normal” requires adaptation to new requirements and the winners are these retailers that priorities online retail. For instance in the United States, e-commerce availability and hygiene caused 17 percent of consumers to leave their primary store. One of the American retailers, Instacart has already addressed issue of optimizing e-commerce and released prediction competition on kaggle with several data sets (“Instacart Market Basket Analysis” 2021). Instacart is an online American retailer that uses advanced analytics algorithms to provide the best experience for their customers.

Dataset preprocessing

Instacart Market Basket loading

Firstly all the data need to be loaded to the R environment.

aisles <- read.csv("data\\aisles.csv", sep=",")
departments <- read.csv("data\\departments.csv", sep=",")
order_products_train <- read.csv("data\\order_products__train.csv", sep=",")
orders <- read.csv("data\\orders.csv", sep=",")
products <- read.csv("data\\products.csv", sep=",")
random_names <- read.csv("data\\random_names.csv", sep=",")

Then in order to reduce the number of rows sample fraction out of order products is taken. We take 50% of transactions which still result in more than half a million transactions.

set.seed(0402)
trans <-
  order_products_train %>%
  sample_frac(size = 0.5) %>%
  inner_join(products, by = "product_id", keep = FALSE) %>%
  inner_join(aisles, by = "aisle_id", keep = FALSE) %>%
  inner_join(orders, by = "order_id", keep = FALSE) %>%
  inner_join(departments, by ="department_id", keep = FALSE)

Preprocessing

A transactional data set is not very easy to handle in current format. Data set contains a lot of ids that are not very interpretable. A decision has been made to perform several transformations which will result in transactional data.

kable_head <- function(df){
  df %>% kable() %>% head()
}

day_of_week <-
  tibble(day_id = 0:6,
         dow_name = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
         )

day_of_week %>% kable_head

[1] "

day_id	dow_name
0	Monday
1	Tuesday
2	Wednesday
3	Thursday
4	Friday
5	Saturday
6	Sunday

reordered_df <-
  tibble(reordered_id = 0:1,
         reorder_name = c("Not reordered", "Reordered")
         )

reordered_df %>% kable_head

[1] "

reordered_id	reorder_name
0	Not reordered
1	Reordered

order_hour_of_day <-
  tibble(
       hour_id = 0:23,
       time_of_day = c(rep("Night",6), rep("Morning", 4), rep("Senior hours", 2),
                    rep("Noon", 2), rep("Afternoon", 4), rep("Early night",4), rep("Night", 2))
       )

order_hour_of_day %>% kable_head

[1] "

hour_id	time_of_day
0	Night
1	Night
2	Night
3	Night
4	Night
5	Night
6	Morning
7	Morning
8	Morning
9	Morning
10	Senior hours
11	Senior hours
12	Noon
13	Noon
14	Afternoon
15	Afternoon
16	Afternoon
17	Afternoon
18	Early night
19	Early night
20	Early night
21	Early night
22	Night
23	Night

Above three mapping tables have been created. The first two ones are self-explanatory, for an hour of the day, six time periods have been assigned for further analysis.

trans_for_ar <-
  trans %>%
  inner_join(day_of_week, by = c("order_dow" = "day_id")) %>%
  inner_join(reordered_df, by = c("reordered" = "reordered_id")) %>%
  inner_join(order_hour_of_day, by = c("order_hour_of_day" = "hour_id")) %>%
  inner_join(random_names, by = c("user_id"= "user_id")) %>%
  select(-aisle_id, -department_id, -eval_set,
         -product_id, -order_dow, - reordered,
         -order_hour_of_day, -days_since_prior_order,
         -order_number, -add_to_cart_order, -user_id) %>%
  arrange(order_id, user_name)

Transactions are combined in one big dataframe, id columns are removed. Columns that are also redundant in terms of analysis is eval_set. Since the chosen data set is very large and only a sample of it has been taken into account the column days_since_prior_order does not make a lot of sense since as of now not every user has order number 1 associated. Therefore there is a decision to omit this column as well order_number.

trans <- as(trans_for_ar, "transactions")
trans_for_viz <- as(trans_for_ar %>% select(product_name, aisle, department), "transactions")

The final transactional dataframe is made of the following attributes:

order_id - needed for a unique transaction specification
product_name - full product name
aisle - aisle name that identifies, where the product belongs inside the store
department - there are 21 department names in the store
dow_name - identifies a day of week when an order has been made
reorder_name - whether the product has been reordered or not
time_of_day - artificially created time of the day based on column value specification
user_name - artificial user name assigned for easier identification than original user_id

Association rules

General view

Association rules are often used for reducing a number of transactions in the databases where each row is coded by binary attributes. There is no need for them to be trained nor to be labeled beforehand. Most common applications are found in the market basket analysis, discovering interesting patterns of DNA and protein sequences, common patterns of behavior can be found for customers that proceed customers dropping their cell phone operator.

During transactional data analysis the technique of association rules and mining frequent itemsets plays an important role in retail basket analysis. This technique is especially useful in mining patterns inside large databases. What is the most frequently used are such statistics as frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules.

Definition:

Let \(I = \{i_1, i_2, ..., i_n \}\) be a set of n binary attributes called items. Let \(D = \{t_1, t_2, ..., t_n \}\) be a set of transactions called database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. The sets of items X and Y are called antecedent and consequent of the rule .

The support(X) is defined as the count of transactions that contains item X to the number of transactions in the whole data set.

The confidence of a rule is defined as \(conf(X => Y) = \frac{supp(X u Y)}{supp(X)}\).

Another practical measure used for adressing issue of finding too many constraints is measure called lift, whihc is defined as \(lift(X => Y)=\frac{supp(X u Y)}{(supp(X)supp(Y))}\). It can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS (Hahsler, Grün, and Hornik 2005).

Associations mining

In order to investigate the nature of the dataset the read.transactions function has been used. Transformations lead to creating transactional matrix with 565097 rows and 132889 columns. Most frequent items are not surprisingly binary feature and transaction identificator which is order_id. There are no missing values inside the database.

Rules found in the dataset

kableRules <- function(rules, sortable){
  rules.sorted<-sort(rules, by=sortable, decreasing=TRUE)
  rules.sorted.df <- as(head(rules.sorted), "data.frame")
  rownames(rules.sorted.df) <- NULL
  rules.sorted.df %>% 
    as_tibble() %>% 
    mutate(across(.cols = c(support, confidence, coverage, lift, count), 
                  ~round(., 4))) %>%
    kable()
}

rules.trans<-apriori(trans, parameter=list(supp=0.1, conf=0.5),
                     control=list(verbose=FALSE))

Mined rules are specified with two thresholds, with minima support 0.1 and minimal confidence 0.5.

kableRules(rules.trans, "confidence")

rules	support	confidence	coverage	lift	count
{aisle=fresh vegetables} => {department=produce}	0.1084	1.0000	0.1084	3.3928	61246
{aisle=fresh fruits} => {department=produce}	0.1085	1.0000	0.1085	3.3928	61332
{department=dairy eggs} => {reorder_name=Reordered}	0.1061	0.6752	0.1572	1.1267	59976
{department=produce} => {reorder_name=Reordered}	0.1961	0.6654	0.2947	1.1104	110824
{dow_name=Monday} => {reorder_name=Reordered}	0.1429	0.6093	0.2345	1.0168	80742
{order_id=[2.28e+06,3.42e+06]} => {reorder_name=Reordered}	0.2008	0.6025	0.3333	1.0054	113483

All the found associations are pretty obvious.

kableRules(rules.trans, "lift")

rules	support	confidence	coverage	lift	count
{aisle=fresh vegetables} => {department=produce}	0.1084	1.0000	0.1084	3.3928	61246
{aisle=fresh fruits} => {department=produce}	0.1085	1.0000	0.1085	3.3928	61332
{department=dairy eggs} => {reorder_name=Reordered}	0.1061	0.6752	0.1572	1.1267	59976
{department=produce} => {reorder_name=Reordered}	0.1961	0.6654	0.2947	1.1104	110824
{dow_name=Monday} => {reorder_name=Reordered}	0.1429	0.6093	0.2345	1.0168	80742
{order_id=[2.28e+06,3.42e+06]} => {reorder_name=Reordered}	0.2008	0.6025	0.3333	1.0054	113483

What is definitely interesting is the dependency between reordering and shopping for food care.

kableRules(rules.trans, "count")

rules	support	confidence	coverage	lift	count
{} => {reorder_name=Reordered}	0.5992	0.5992	1.0000	1.0000	338631
{order_id=[2.28e+06,3.42e+06]} => {reorder_name=Reordered}	0.2008	0.6025	0.3333	1.0054	113483
{order_id=[1,1.13e+06)} => {reorder_name=Reordered}	0.1993	0.5978	0.3333	0.9976	112602
{order_id=[1.13e+06,2.28e+06)} => {reorder_name=Reordered}	0.1992	0.5975	0.3333	0.9971	112546
{department=produce} => {reorder_name=Reordered}	0.1961	0.6654	0.2947	1.1104	110824
{time_of_day=Afternoon} => {reorder_name=Reordered}	0.1889	0.5882	0.3211	0.9816	106731

Not surprisingly order_id and reorder_name are placed high. Count suggest that afternoon time of the day is the most applicable in terms of shopping behavior.

Questions

Question 1

What is the profile of transactions that are being made during Senior hours?

rules.Senior.hours<-apriori(data=trans,
                            parameter=list(supp=0.001, conf=0.08),
                            appearance=list(default="lhs", rhs= "time_of_day=Senior hours"),
                            control=list(verbose=F))
kableRules(rules.Senior.hours, "confidence")

rules	support	confidence	coverage	lift	count
{department=beverages,dow_name=Tuesday,reorder_name=Reordered} => {time_of_day=Senior hours}	0.0017	0.2023	0.0084	1.2501	965
{department=snacks,dow_name=Tuesday,reorder_name=Reordered} => {time_of_day=Senior hours}	0.0016	0.2012	0.0078	1.2431	889
{department=beverages,dow_name=Tuesday} => {time_of_day=Senior hours}	0.0024	0.1921	0.0124	1.1873	1351
{department=household,dow_name=Monday} => {time_of_day=Senior hours}	0.0010	0.1902	0.0054	1.1750	583
{aisle=coffee} => {time_of_day=Senior hours}	0.0012	0.1897	0.0061	1.1721	651
{aisle=coffee,department=beverages} => {time_of_day=Senior hours}	0.0012	0.1897	0.0061	1.1721	651

As documentation claims rule that contain empty bracket means that no matter what other items have been chosen the item in the rhs will be chosen with the level of confidence which is equal to support.

Question 2

What is the profile of transactions that are being made during Senior hours?

rules.Senior.hours<-apriori(data=trans,
                            parameter=list(supp=0.001, conf=0.08),
                            appearance=list(default="lhs", rhs= "time_of_day=Senior hours"),
                            control=list(verbose=F))
kableRules(rules.Senior.hours, "lift")

rules	support	confidence	coverage	lift	count
{department=beverages,dow_name=Tuesday,reorder_name=Reordered} => {time_of_day=Senior hours}	0.0017	0.2023	0.0084	1.2501	965
{department=snacks,dow_name=Tuesday,reorder_name=Reordered} => {time_of_day=Senior hours}	0.0016	0.2012	0.0078	1.2431	889
{department=beverages,dow_name=Tuesday} => {time_of_day=Senior hours}	0.0024	0.1921	0.0124	1.1873	1351
{department=household,dow_name=Monday} => {time_of_day=Senior hours}	0.0010	0.1902	0.0054	1.1750	583
{aisle=coffee} => {time_of_day=Senior hours}	0.0012	0.1897	0.0061	1.1721	651
{aisle=coffee,department=beverages} => {time_of_day=Senior hours}	0.0012	0.1897	0.0061	1.1721	651

We can clearly see that there is no difference for senior hours with ordering by confidence and by lift. Given these metrics we spot that most beverages are chosen, people have tendency to use Senior hours on Tuesdays. Others need additional coffeine. It is clear that there are no aisles selected such as “eye ear care”, “digestion” which might suggest that these hours are often used by people with some serious disorders. Let’s try now to check what is the profile of these two categories based on hours.

rules.muscles.joints.pain.relief<-apriori(data=trans,
                                          parameter=list(supp=0.0001, conf=0.000001),
                                          appearance=list(default="lhs",
                                                          rhs= "aisle=muscles joints pain relief"),
                                          control=list(verbose=F))
kableRules(rules.muscles.joints.pain.relief, "coverage")

rules	support	confidence	coverage	lift	count
{} => {aisle=muscles joints pain relief}	7e-04	0.0007	1.0000	1.0000	378
{reorder_name=Reordered} => {aisle=muscles joints pain relief}	2e-04	0.0004	0.5992	0.5739	130
{reorder_name=Not reordered} => {aisle=muscles joints pain relief}	4e-04	0.0011	0.4008	1.6371	248
{order_id=[2.28e+06,3.42e+06]} => {aisle=muscles joints pain relief}	2e-04	0.0007	0.3333	0.9841	124
{order_id=[1.13e+06,2.28e+06)} => {aisle=muscles joints pain relief}	2e-04	0.0007	0.3333	1.0794	136
{order_id=[1,1.13e+06)} => {aisle=muscles joints pain relief}	2e-04	0.0006	0.3333	0.9365	118

As we can see senior hours are placed in the sixth position but still coverage is very low. For other metrics label that indicates these particular hours has not been placed in the top of the resulting table.

Question 3

How the Friday mood affects shooping?

rules.friday<-apriori(data=trans,
                      parameter=list(supp=0.0001, conf=0.000001),
                      appearance=list(default="lhs", rhs= "dow_name=Friday"),
                      control=list(verbose=F))
kableRules(rules.friday, "confidence")

rules	support	confidence	coverage	lift	count
{department=alcohol,time_of_day=Morning} => {dow_name=Friday}	1e-04	0.2336	5e-04	2.0969	64
{order_id=[1.13e+06,2.28e+06),aisle=beers coolers} => {dow_name=Friday}	1e-04	0.2156	5e-04	1.9357	58
{order_id=[1.13e+06,2.28e+06),aisle=beers coolers,department=alcohol} => {dow_name=Friday}	1e-04	0.2156	5e-04	1.9357	58
{order_id=[2.28e+06,3.42e+06],department=alcohol,reorder_name=Not reordered} => {dow_name=Friday}	1e-04	0.2111	5e-04	1.8949	61
{aisle=beers coolers,reorder_name=Not reordered} => {dow_name=Friday}	1e-04	0.2098	5e-04	1.8838	64
{aisle=beers coolers,department=alcohol,reorder_name=Not reordered} => {dow_name=Friday}	1e-04	0.2098	5e-04	1.8838	64

Alcohol is definitely placed very high, any sort of it occurs in all of the top results. Once they add alcohol to the cart they do not reorder it. What might be interesting is that people usually buy alcohol in the morning. It might be simply because most of the customers on Friday do shopping in the morning in order to have more free time during the proper weekend.

trans_for_ar %>%
  group_by(dow_name, time_of_day) %>%
  count() %>%
  ungroup() %>%
  group_by(dow_name) %>%
  mutate(perc = n/sum(n))

The results show that most of the purchases are being made on Friday afternoon. That is interesting why people buy alcohol in the morning rather than in the afternoon. What might also be absorbing is that people do not choose any salty food in addition to the alcoholic drinks.

rules.lunch.meat<-apriori(data=trans,
                               parameter=list(supp=0.0001, conf=0.000001),
                               appearance=list(default="lhs", rhs= "aisle=lunch meat"),
                                control=list(verbose=F))
kableRules(rules.lunch.meat, "confidence")

rules	support	confidence	coverage	lift	count
{product_name=Sliced Soppressata Salame} => {aisle=lunch meat}	1e-04	1	1e-04	82.017	57
{product_name=Rosemary Ham} => {aisle=lunch meat}	1e-04	1	1e-04	82.017	60
{product_name=Hard Salami} => {aisle=lunch meat}	1e-04	1	1e-04	82.017	60
{product_name=Deli Fresh Honey Ham, 97% Fat Free, Gluten Free} => {aisle=lunch meat}	1e-04	1	1e-04	82.017	65
{product_name=Organic Roast Beef} => {aisle=lunch meat}	1e-04	1	1e-04	82.017	66
{product_name=Uncured Diced Pancetta} => {aisle=lunch meat}	1e-04	1	1e-04	82.017	68

Question 4

Investigate weekdays in terms of some association metrics.

rules <- apriori(trans, parameter=list(supp=0.001, conf=0.001))

Apriori

Parameter specification: confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext 0.001 0.1 1 none FALSE TRUE 5 0.001 1 10 rules TRUE

Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count: 565

set item appearances …[0 item(s)] done [0.00s]. set transactions …[132889 item(s), 565097 transaction(s)] done [5.66s]. sorting and recoding items … [236 item(s)] done [0.07s]. creating transaction tree … done [0.97s]. checking subsets of size 1 2 3 4 5 6 done [0.05s]. writing … [33763 rule(s)] done [0.13s]. creating S4 object … done [0.17s].

rules.dow.tod <- subset(rules, subset = lhs %pin% "dow_name=" & rhs %pin% "time_of_day=")
kableRules(rules.dow.tod, "confidence")

rules	support	confidence	coverage	lift	count
{aisle=ice cream ice,dow_name=Sunday} => {time_of_day=Afternoon}	0.0011	0.3826	0.0029	1.1915	619
{aisle=ice cream ice,department=frozen,dow_name=Sunday} => {time_of_day=Afternoon}	0.0011	0.3826	0.0029	1.1915	619
{order_id=[1.13e+06,2.28e+06),department=snacks,dow_name=Sunday} => {time_of_day=Afternoon}	0.0015	0.3704	0.0041	1.1536	849
{aisle=ice cream ice,dow_name=Monday} => {time_of_day=Afternoon}	0.0013	0.3694	0.0035	1.1503	728
{aisle=ice cream ice,department=frozen,dow_name=Monday} => {time_of_day=Afternoon}	0.0013	0.3694	0.0035	1.1503	728
{aisle=yogurt,dow_name=Monday,reorder_name=Not reordered} => {time_of_day=Afternoon}	0.0011	0.3675	0.0030	1.1446	627

In terms of the first row of rules sorted by confidence 5.8 % of transactions containing during Night has been made on Thursday. Discussing lift it also indicates that the strongest association is between Thursday and Night.

rules <- apriori(trans, parameter=list(supp=0.001, conf=0.001))

Apriori

Parameter specification: confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext 0.001 0.1 1 none FALSE TRUE 5 0.001 1 10 rules TRUE

Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count: 565

set item appearances …[0 item(s)] done [0.00s]. set transactions …[132889 item(s), 565097 transaction(s)] done [5.07s]. sorting and recoding items … [236 item(s)] done [0.07s]. creating transaction tree … done [1.10s]. checking subsets of size 1 2 3 4 5 6 done [0.05s]. writing … [33763 rule(s)] done [0.13s]. creating S4 object … done [0.17s].

rules.tod.dow <- subset(rules, subset = lhs %pin% "time_of_day=" & rhs %pin% "dow_name=")
kableRules(rules.tod.dow, "count")

rules	support	confidence	coverage	lift	count
{time_of_day=Afternoon} => {dow_name=Monday}	0.0780	0.2429	0.3211	1.0360	44082
{time_of_day=Afternoon} => {dow_name=Sunday}	0.0493	0.1537	0.3211	1.0227	27883
{time_of_day=Afternoon} => {dow_name=Tuesday}	0.0481	0.1497	0.3211	1.0062	27155
{reorder_name=Reordered,time_of_day=Afternoon} => {dow_name=Monday}	0.0467	0.2475	0.1889	1.0552	26411
{time_of_day=Noon} => {dow_name=Monday}	0.0407	0.2500	0.1630	1.0659	23026
{time_of_day=Senior hours} => {dow_name=Monday}	0.0396	0.2449	0.1618	1.0445	22401

Most of the transactions found in dataset is on Monday Afternoon.

Visualization techniques

Hierarchical clustering

trans.sel <-trans_for_viz[,itemFrequency(trans_for_viz)>0.05] # selected transations
d.jac.i<-dissimilarity(trans.sel, which="items") # Jaccard as default
plot(hclust(d.jac.i, method="ward.D2"), main="Dendrogram for items")

The interpretation of the hclust graph is following if we keep the three biggest clusters:

the first group of transactions is formed out of fresh vegetables and fresh fruits which do belong to produce department,
the second group contains transactions made between 12 and 18,
last but not least we have third cluster that can be best described as mostly snacks and beverages which are bought either at night or between 10 and 12.

itemFrequencyPlot(trans_for_viz, topN=10, type="absolute", main="Item Frequency")

In terms of item frequency, it is shown that most of the record has been added in the afternoon. Customers also focus on keeping in shape since a lot of transactions are made inside the produce department, others in dairy eggs.

rules_for_viz <- apriori(trans_for_viz, parameter = list(support = 0.001))

Apriori

Parameter specification: confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 5 0.001 1 10 rules TRUE

Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count: 565

set item appearances …[0 item(s)] done [0.00s]. set transactions …[31726 item(s), 565097 transaction(s)] done [2.28s]. sorting and recoding items … [218 item(s)] done [0.02s]. creating transaction tree … done [0.54s]. checking subsets of size 1 2 3 done [0.12s]. writing … [489 rule(s)] done [0.00s]. creating S4 object … done [0.11s].

plot(head(rules_for_viz, 30, by = "lift"), method = "paracoord", reorder =TRUE)

The figure above shows rules with arrows where the width of arrows is linked to support and the color confidence contains information about the confidence of given rule. The values of each dimension are connected with each other via line. The y-axis is formed via nominal values and the x-axis presents the position in the rule.

Conclusion

Association rules are a very useful technique for mining patterns in large datasets due to their algorithms. What is more, it can save a lot of time in finding user preferences and what drives their choices. Most of the useful functionalities are implemented either in a form of association rules mining functionalities or graphs. For the latter arulesViz (Hahsler 2017) is widely used where a lot of brand new visualization techniques to research association rules has been implemented. The main features of arules (Hahsler, Grün, and Hornik 2005) in terms of association rules are efficient implementation with sparse matrices usage.

Bibliography

Hahsler, Michael. 2017. “ArulesViz: Interactive Visualization of Association Rules with R.” R Journal 9 (December): 163–75. https://doi.org/10.32614/RJ-2017-047.

Hahsler, Michael, Bettina Grün, and Kurt Hornik. 2005. “Arules - a Computational Environment for Mining Association Rules and Frequent Item Sets.” Journal of Statistical Software, Articles 14 (15): 1–25. https://doi.org/10.18637/jss.v014.i15.

“Instacart Market Basket Analysis.” 2021. https://www.kaggle.com/c/instacart-market-basket-analysis. February 5, 2021.

Rules that can be found inside the shopping cart. How to quickly measure transactions relationships?

Mateusz Baryła

06.02.2021

Introduction

Dataset preprocessing

Instacart Market Basket loading

Preprocessing

Association rules

General view

Associations mining

Rules found in the dataset

Questions

Question 1

Question 2

Question 3

Question 4

Visualization techniques

Hierarchical clustering

Conclusion

Bibliography