Introduction

During last year a retail optimization has gained a lot of attention in the data science community. The COVID-19 pandemic resulted in restrictions that brought restrictions that changed the behavior of customers. The retail industry forced the closure of physical stores. “The Next Normal” requires adaptation to new requirements and the winners are these retailers that priorities online retail. For instance in the United States, e-commerce availability and hygiene caused 17 percent of consumers to leave their primary store. One of the American retailers, Instacart has already addressed issue of optimizing e-commerce and released prediction competition on kaggle with several data sets (“Instacart Market Basket Analysis” 2021). Instacart is an online American retailer that uses advanced analytics algorithms to provide the best experience for their customers.

Dataset preprocessing

Instacart Market Basket loading

Firstly all the data need to be loaded to the R environment.

aisles <- read.csv("data\\aisles.csv", sep=",")
departments <- read.csv("data\\departments.csv", sep=",")
order_products_train <- read.csv("data\\order_products__train.csv", sep=",")
orders <- read.csv("data\\orders.csv", sep=",")
products <- read.csv("data\\products.csv", sep=",")
random_names <- read.csv("data\\random_names.csv", sep=",")

Then in order to reduce the number of rows sample fraction out of order products is taken. We take 50% of transactions which still result in more than half a million transactions.

set.seed(0402)
trans <-
  order_products_train %>%
  sample_frac(size = 0.5) %>%
  inner_join(products, by = "product_id", keep = FALSE) %>%
  inner_join(aisles, by = "aisle_id", keep = FALSE) %>%
  inner_join(orders, by = "order_id", keep = FALSE) %>%
  inner_join(departments, by ="department_id", keep = FALSE)

Preprocessing

A transactional data set is not very easy to handle in current format. Data set contains a lot of ids that are not very interpretable. A decision has been made to perform several transformations which will result in transactional data.

kable_head <- function(df){
  df %>% kable() %>% head()
}
day_of_week <-
  tibble(day_id = 0:6,
         dow_name = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
         )

day_of_week %>% kable_head
[1] "
day_id dow_name
0 Monday
1 Tuesday
2 Wednesday
3 Thursday
4 Friday
5 Saturday
6 Sunday

"

reordered_df <-
  tibble(reordered_id = 0:1,
         reorder_name = c("Not reordered", "Reordered")
         )

reordered_df %>% kable_head
[1] "
reordered_id reorder_name
0 Not reordered
1 Reordered

"

order_hour_of_day <-
  tibble(
       hour_id = 0:23,
       time_of_day = c(rep("Night",6), rep("Morning", 4), rep("Senior hours", 2),
                    rep("Noon", 2), rep("Afternoon", 4), rep("Early night",4), rep("Night", 2))
       )

order_hour_of_day %>% kable_head
[1] "
hour_id time_of_day
0 Night
1 Night
2 Night
3 Night
4 Night
5 Night
6 Morning
7 Morning
8 Morning
9 Morning
10 Senior hours
11 Senior hours
12 Noon
13 Noon
14 Afternoon
15 Afternoon
16 Afternoon
17 Afternoon
18 Early night
19 Early night
20 Early night
21 Early night
22 Night
23 Night

"

Above three mapping tables have been created. The first two ones are self-explanatory, for an hour of the day, six time periods have been assigned for further analysis.

trans_for_ar <-
  trans %>%
  inner_join(day_of_week, by = c("order_dow" = "day_id")) %>%
  inner_join(reordered_df, by = c("reordered" = "reordered_id")) %>%
  inner_join(order_hour_of_day, by = c("order_hour_of_day" = "hour_id")) %>%
  inner_join(random_names, by = c("user_id"= "user_id")) %>%
  select(-aisle_id, -department_id, -eval_set,
         -product_id, -order_dow, - reordered,
         -order_hour_of_day, -days_since_prior_order,
         -order_number, -add_to_cart_order, -user_id) %>%
  arrange(order_id, user_name)

Transactions are combined in one big dataframe, id columns are removed. Columns that are also redundant in terms of analysis is eval_set. Since the chosen data set is very large and only a sample of it has been taken into account the column days_since_prior_order does not make a lot of sense since as of now not every user has order number 1 associated. Therefore there is a decision to omit this column as well order_number.

trans <- as(trans_for_ar, "transactions")
trans_for_viz <- as(trans_for_ar %>% select(product_name, aisle, department), "transactions")

The final transactional dataframe is made of the following attributes:

  • order_id - needed for a unique transaction specification
  • product_name - full product name
  • aisle - aisle name that identifies, where the product belongs inside the store
  • department - there are 21 department names in the store
  • dow_name - identifies a day of week when an order has been made
  • reorder_name - whether the product has been reordered or not
  • time_of_day - artificially created time of the day based on column value specification
  • user_name - artificial user name assigned for easier identification than original user_id

Association rules

General view

Association rules are often used for reducing a number of transactions in the databases where each row is coded by binary attributes. There is no need for them to be trained nor to be labeled beforehand. Most common applications are found in the market basket analysis, discovering interesting patterns of DNA and protein sequences, common patterns of behavior can be found for customers that proceed customers dropping their cell phone operator.

During transactional data analysis the technique of association rules and mining frequent itemsets plays an important role in retail basket analysis. This technique is especially useful in mining patterns inside large databases. What is the most frequently used are such statistics as frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules.

Definition:

Let \(I = \{i_1, i_2, ..., i_n \}\) be a set of n binary attributes called items. Let \(D = \{t_1, t_2, ..., t_n \}\) be a set of transactions called database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. The sets of items X and Y are called antecedent and consequent of the rule .

The support(X) is defined as the count of transactions that contains item X to the number of transactions in the whole data set.

The confidence of a rule is defined as \(conf(X => Y) = \frac{supp(X u Y)}{supp(X)}\).

Another practical measure used for adressing issue of finding too many constraints is measure called lift, whihc is defined as \(lift(X => Y)=\frac{supp(X u Y)}{(supp(X)supp(Y))}\). It can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS (Hahsler, Grün, and Hornik 2005).

Associations mining

In order to investigate the nature of the dataset the read.transactions function has been used. Transformations lead to creating transactional matrix with 565097 rows and 132889 columns. Most frequent items are not surprisingly binary feature and transaction identificator which is order_id. There are no missing values inside the database.

Rules found in the dataset

kableRules <- function(rules, sortable){
  rules.sorted<-sort(rules, by=sortable, decreasing=TRUE)
  rules.sorted.df <- as(head(rules.sorted), "data.frame")
  rownames(rules.sorted.df) <- NULL
  rules.sorted.df %>% 
    as_tibble() %>% 
    mutate(across(.cols = c(support, confidence, coverage, lift, count), 
                  ~round(., 4))) %>%
    kable()
}
rules.trans<-apriori(trans, parameter=list(supp=0.1, conf=0.5),
                     control=list(verbose=FALSE))



Mined rules are specified with two thresholds, with minima support 0.1 and minimal confidence 0.5.

kableRules(rules.trans, "confidence")
rules support confidence coverage lift count
{aisle=fresh vegetables} => {department=produce} 0.1084 1.0000 0.1084 3.3928 61246
{aisle=fresh fruits} => {department=produce} 0.1085 1.0000 0.1085 3.3928 61332
{department=dairy eggs} => {reorder_name=Reordered} 0.1061 0.6752 0.1572 1.1267 59976
{department=produce} => {reorder_name=Reordered} 0.1961 0.6654 0.2947 1.1104 110824
{dow_name=Monday} => {reorder_name=Reordered} 0.1429 0.6093 0.2345 1.0168 80742
{order_id=[2.28e+06,3.42e+06]} => {reorder_name=Reordered} 0.2008 0.6025 0.3333 1.0054 113483



All the found associations are pretty obvious.

kableRules(rules.trans, "lift")
rules support confidence coverage lift count
{aisle=fresh vegetables} => {department=produce} 0.1084 1.0000 0.1084 3.3928 61246
{aisle=fresh fruits} => {department=produce} 0.1085 1.0000 0.1085 3.3928 61332
{department=dairy eggs} => {reorder_name=Reordered} 0.1061 0.6752 0.1572 1.1267 59976
{department=produce} => {reorder_name=Reordered} 0.1961 0.6654 0.2947 1.1104 110824
{dow_name=Monday} => {reorder_name=Reordered} 0.1429 0.6093 0.2345 1.0168 80742
{order_id=[2.28e+06,3.42e+06]} => {reorder_name=Reordered} 0.2008 0.6025 0.3333 1.0054 113483



What is definitely interesting is the dependency between reordering and shopping for food care.

kableRules(rules.trans, "count")
rules support confidence coverage lift count
{} => {reorder_name=Reordered} 0.5992 0.5992 1.0000 1.0000 338631
{order_id=[2.28e+06,3.42e+06]} => {reorder_name=Reordered} 0.2008 0.6025 0.3333 1.0054 113483
{order_id=[1,1.13e+06)} => {reorder_name=Reordered} 0.1993 0.5978 0.3333 0.9976 112602
{order_id=[1.13e+06,2.28e+06)} => {reorder_name=Reordered} 0.1992 0.5975 0.3333 0.9971 112546
{department=produce} => {reorder_name=Reordered} 0.1961 0.6654 0.2947 1.1104 110824
{time_of_day=Afternoon} => {reorder_name=Reordered} 0.1889 0.5882 0.3211 0.9816 106731



Not surprisingly order_id and reorder_name are placed high. Count suggest that afternoon time of the day is the most applicable in terms of shopping behavior.

Questions

Question 1

What is the profile of transactions that are being made during Senior hours?

rules.Senior.hours<-apriori(data=trans,
                            parameter=list(supp=0.001, conf=0.08),
                            appearance=list(default="lhs", rhs= "time_of_day=Senior hours"),
                            control=list(verbose=F))
kableRules(rules.Senior.hours, "confidence")
rules support confidence coverage lift count
{department=beverages,dow_name=Tuesday,reorder_name=Reordered} => {time_of_day=Senior hours} 0.0017 0.2023 0.0084 1.2501 965
{department=snacks,dow_name=Tuesday,reorder_name=Reordered} => {time_of_day=Senior hours} 0.0016 0.2012 0.0078 1.2431 889
{department=beverages,dow_name=Tuesday} => {time_of_day=Senior hours} 0.0024 0.1921 0.0124 1.1873 1351
{department=household,dow_name=Monday} => {time_of_day=Senior hours} 0.0010 0.1902 0.0054 1.1750 583
{aisle=coffee} => {time_of_day=Senior hours} 0.0012 0.1897 0.0061 1.1721 651
{aisle=coffee,department=beverages} => {time_of_day=Senior hours} 0.0012 0.1897 0.0061 1.1721 651



As documentation claims rule that contain empty bracket means that no matter what other items have been chosen the item in the rhs will be chosen with the level of confidence which is equal to support.

Question 2

What is the profile of transactions that are being made during Senior hours?

rules.Senior.hours<-apriori(data=trans,
                            parameter=list(supp=0.001, conf=0.08),
                            appearance=list(default="lhs", rhs= "time_of_day=Senior hours"),
                            control=list(verbose=F))
kableRules(rules.Senior.hours, "lift")
rules support confidence coverage lift count
{department=beverages,dow_name=Tuesday,reorder_name=Reordered} => {time_of_day=Senior hours} 0.0017 0.2023 0.0084 1.2501 965
{department=snacks,dow_name=Tuesday,reorder_name=Reordered} => {time_of_day=Senior hours} 0.0016 0.2012 0.0078 1.2431 889
{department=beverages,dow_name=Tuesday} => {time_of_day=Senior hours} 0.0024 0.1921 0.0124 1.1873 1351
{department=household,dow_name=Monday} => {time_of_day=Senior hours} 0.0010 0.1902 0.0054 1.1750 583
{aisle=coffee} => {time_of_day=Senior hours} 0.0012 0.1897 0.0061 1.1721 651
{aisle=coffee,department=beverages} => {time_of_day=Senior hours} 0.0012 0.1897 0.0061 1.1721 651



We can clearly see that there is no difference for senior hours with ordering by confidence and by lift. Given these metrics we spot that most beverages are chosen, people have tendency to use Senior hours on Tuesdays. Others need additional coffeine. It is clear that there are no aisles selected such as “eye ear care”, “digestion” which might suggest that these hours are often used by people with some serious disorders. Let’s try now to check what is the profile of these two categories based on hours.

rules.muscles.joints.pain.relief<-apriori(data=trans,
                                          parameter=list(supp=0.0001, conf=0.000001),
                                          appearance=list(default="lhs",
                                                          rhs= "aisle=muscles joints pain relief"),
                                          control=list(verbose=F))
kableRules(rules.muscles.joints.pain.relief, "coverage")
rules support confidence coverage lift count
{} => {aisle=muscles joints pain relief} 7e-04 0.0007 1.0000 1.0000 378
{reorder_name=Reordered} => {aisle=muscles joints pain relief} 2e-04 0.0004 0.5992 0.5739 130
{reorder_name=Not reordered} => {aisle=muscles joints pain relief} 4e-04 0.0011 0.4008 1.6371 248
{order_id=[2.28e+06,3.42e+06]} => {aisle=muscles joints pain relief} 2e-04 0.0007 0.3333 0.9841 124
{order_id=[1.13e+06,2.28e+06)} => {aisle=muscles joints pain relief} 2e-04 0.0007 0.3333 1.0794 136
{order_id=[1,1.13e+06)} => {aisle=muscles joints pain relief} 2e-04 0.0006 0.3333 0.9365 118



As we can see senior hours are placed in the sixth position but still coverage is very low. For other metrics label that indicates these particular hours has not been placed in the top of the resulting table.

Question 3

How the Friday mood affects shooping?

rules.friday<-apriori(data=trans,
                      parameter=list(supp=0.0001, conf=0.000001),
                      appearance=list(default="lhs", rhs= "dow_name=Friday"),
                      control=list(verbose=F))
kableRules(rules.friday, "confidence")
rules support confidence coverage lift count
{department=alcohol,time_of_day=Morning} => {dow_name=Friday} 1e-04 0.2336 5e-04 2.0969 64
{order_id=[1.13e+06,2.28e+06),aisle=beers coolers} => {dow_name=Friday} 1e-04 0.2156 5e-04 1.9357 58
{order_id=[1.13e+06,2.28e+06),aisle=beers coolers,department=alcohol} => {dow_name=Friday} 1e-04 0.2156 5e-04 1.9357 58
{order_id=[2.28e+06,3.42e+06],department=alcohol,reorder_name=Not reordered} => {dow_name=Friday} 1e-04 0.2111 5e-04 1.8949 61
{aisle=beers coolers,reorder_name=Not reordered} => {dow_name=Friday} 1e-04 0.2098 5e-04 1.8838 64
{aisle=beers coolers,department=alcohol,reorder_name=Not reordered} => {dow_name=Friday} 1e-04 0.2098 5e-04 1.8838 64



Alcohol is definitely placed very high, any sort of it occurs in all of the top results. Once they add alcohol to the cart they do not reorder it. What might be interesting is that people usually buy alcohol in the morning. It might be simply because most of the customers on Friday do shopping in the morning in order to have more free time during the proper weekend.

trans_for_ar %>%
  group_by(dow_name, time_of_day) %>%
  count() %>%
  ungroup() %>%
  group_by(dow_name) %>%
  mutate(perc = n/sum(n))



The results show that most of the purchases are being made on Friday afternoon. That is interesting why people buy alcohol in the morning rather than in the afternoon. What might also be absorbing is that people do not choose any salty food in addition to the alcoholic drinks.

rules.lunch.meat<-apriori(data=trans,
                               parameter=list(supp=0.0001, conf=0.000001),
                               appearance=list(default="lhs", rhs= "aisle=lunch meat"),
                                control=list(verbose=F))
kableRules(rules.lunch.meat, "confidence")
rules support confidence coverage lift count
{product_name=Sliced Soppressata Salame} => {aisle=lunch meat} 1e-04 1 1e-04 82.017 57
{product_name=Rosemary Ham} => {aisle=lunch meat} 1e-04 1 1e-04 82.017 60
{product_name=Hard Salami} => {aisle=lunch meat} 1e-04 1 1e-04 82.017 60
{product_name=Deli Fresh Honey Ham, 97% Fat Free, Gluten Free} => {aisle=lunch meat} 1e-04 1 1e-04 82.017 65
{product_name=Organic Roast Beef} => {aisle=lunch meat} 1e-04 1 1e-04 82.017 66
{product_name=Uncured Diced Pancetta} => {aisle=lunch meat} 1e-04 1 1e-04 82.017 68

Question 4

Investigate weekdays in terms of some association metrics.

rules <- apriori(trans, parameter=list(supp=0.001, conf=0.001))

Apriori

Parameter specification: confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext 0.001 0.1 1 none FALSE TRUE 5 0.001 1 10 rules TRUE

Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count: 565

set item appearances …[0 item(s)] done [0.00s]. set transactions …[132889 item(s), 565097 transaction(s)] done [5.66s]. sorting and recoding items … [236 item(s)] done [0.07s]. creating transaction tree … done [0.97s]. checking subsets of size 1 2 3 4 5 6 done [0.05s]. writing … [33763 rule(s)] done [0.13s]. creating S4 object … done [0.17s].

rules.dow.tod <- subset(rules, subset = lhs %pin% "dow_name=" & rhs %pin% "time_of_day=")
kableRules(rules.dow.tod, "confidence")
rules support confidence coverage lift count
{aisle=ice cream ice,dow_name=Sunday} => {time_of_day=Afternoon} 0.0011 0.3826 0.0029 1.1915 619
{aisle=ice cream ice,department=frozen,dow_name=Sunday} => {time_of_day=Afternoon} 0.0011 0.3826 0.0029 1.1915 619
{order_id=[1.13e+06,2.28e+06),department=snacks,dow_name=Sunday} => {time_of_day=Afternoon} 0.0015 0.3704 0.0041 1.1536 849
{aisle=ice cream ice,dow_name=Monday} => {time_of_day=Afternoon} 0.0013 0.3694 0.0035 1.1503 728
{aisle=ice cream ice,department=frozen,dow_name=Monday} => {time_of_day=Afternoon} 0.0013 0.3694 0.0035 1.1503 728
{aisle=yogurt,dow_name=Monday,reorder_name=Not reordered} => {time_of_day=Afternoon} 0.0011 0.3675 0.0030 1.1446 627



In terms of the first row of rules sorted by confidence 5.8 % of transactions containing during Night has been made on Thursday. Discussing lift it also indicates that the strongest association is between Thursday and Night.

rules <- apriori(trans, parameter=list(supp=0.001, conf=0.001))

Apriori

Parameter specification: confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext 0.001 0.1 1 none FALSE TRUE 5 0.001 1 10 rules TRUE

Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count: 565

set item appearances …[0 item(s)] done [0.00s]. set transactions …[132889 item(s), 565097 transaction(s)] done [5.07s]. sorting and recoding items … [236 item(s)] done [0.07s]. creating transaction tree … done [1.10s]. checking subsets of size 1 2 3 4 5 6 done [0.05s]. writing … [33763 rule(s)] done [0.13s]. creating S4 object … done [0.17s].

rules.tod.dow <- subset(rules, subset = lhs %pin% "time_of_day=" & rhs %pin% "dow_name=")
kableRules(rules.tod.dow, "count")
rules support confidence coverage lift count
{time_of_day=Afternoon} => {dow_name=Monday} 0.0780 0.2429 0.3211 1.0360 44082
{time_of_day=Afternoon} => {dow_name=Sunday} 0.0493 0.1537 0.3211 1.0227 27883
{time_of_day=Afternoon} => {dow_name=Tuesday} 0.0481 0.1497 0.3211 1.0062 27155
{reorder_name=Reordered,time_of_day=Afternoon} => {dow_name=Monday} 0.0467 0.2475 0.1889 1.0552 26411
{time_of_day=Noon} => {dow_name=Monday} 0.0407 0.2500 0.1630 1.0659 23026
{time_of_day=Senior hours} => {dow_name=Monday} 0.0396 0.2449 0.1618 1.0445 22401

Most of the transactions found in dataset is on Monday Afternoon.

Visualization techniques

Hierarchical clustering

trans.sel <-trans_for_viz[,itemFrequency(trans_for_viz)>0.05] # selected transations
d.jac.i<-dissimilarity(trans.sel, which="items") # Jaccard as default
plot(hclust(d.jac.i, method="ward.D2"), main="Dendrogram for items")

The interpretation of the hclust graph is following if we keep the three biggest clusters:

  • the first group of transactions is formed out of fresh vegetables and fresh fruits which do belong to produce department,
  • the second group contains transactions made between 12 and 18,
  • last but not least we have third cluster that can be best described as mostly snacks and beverages which are bought either at night or between 10 and 12.
itemFrequencyPlot(trans_for_viz, topN=10, type="absolute", main="Item Frequency")

In terms of item frequency, it is shown that most of the record has been added in the afternoon. Customers also focus on keeping in shape since a lot of transactions are made inside the produce department, others in dairy eggs.

rules_for_viz <- apriori(trans_for_viz, parameter = list(support = 0.001))

Apriori

Parameter specification: confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 5 0.001 1 10 rules TRUE

Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count: 565

set item appearances …[0 item(s)] done [0.00s]. set transactions …[31726 item(s), 565097 transaction(s)] done [2.28s]. sorting and recoding items … [218 item(s)] done [0.02s]. creating transaction tree … done [0.54s]. checking subsets of size 1 2 3 done [0.12s]. writing … [489 rule(s)] done [0.00s]. creating S4 object … done [0.11s].


plot(head(rules_for_viz, 30, by = "lift"), method = "paracoord", reorder =TRUE)

The figure above shows rules with arrows where the width of arrows is linked to support and the color confidence contains information about the confidence of given rule. The values of each dimension are connected with each other via line. The y-axis is formed via nominal values and the x-axis presents the position in the rule.

Conclusion

Association rules are a very useful technique for mining patterns in large datasets due to their algorithms. What is more, it can save a lot of time in finding user preferences and what drives their choices. Most of the useful functionalities are implemented either in a form of association rules mining functionalities or graphs. For the latter arulesViz (Hahsler 2017) is widely used where a lot of brand new visualization techniques to research association rules has been implemented. The main features of arules (Hahsler, Grün, and Hornik 2005) in terms of association rules are efficient implementation with sparse matrices usage.

Bibliography

Hahsler, Michael. 2017. “ArulesViz: Interactive Visualization of Association Rules with R.” R Journal 9 (December): 163–75. https://doi.org/10.32614/RJ-2017-047.

Hahsler, Michael, Bettina Grün, and Kurt Hornik. 2005. “Arules - a Computational Environment for Mining Association Rules and Frequent Item Sets.” Journal of Statistical Software, Articles 14 (15): 1–25. https://doi.org/10.18637/jss.v014.i15.

“Instacart Market Basket Analysis.” 2021. https://www.kaggle.com/c/instacart-market-basket-analysis. February 5, 2021.