Photo by Leonie Wise on Unsplash

Section 1 - Overview

In this week of dataviz, I will first use market basket analysis to find the association rules between different products, then use network data visualisation to illustrate the associations between different products.

Section 1.1 - Market Basket Analysis

Market basket analysis is a type of data mining method to uncover the underlying relationships between different items. It is often being used by the business to understand what items tend to be purchased together.

In market basket analysis, three of the important concepts are:
- Support: Measure how frequent the item or item set appears in the transactions
- Confidence: Measure the likelihood that customers would buy the products shown in the rules, given that they have the products listed on the left hand side in their basket
- Lift: Co-occurence of products on the left hand side and right hand side exceeds the likelihood of products on left hand side and right hand side are independent

Section 1.2 - Network Visualisation

Network visualisation technique is one of visualisation technique to illustrate how the relationships of different data points. This would be very handy in helping us drawing insights from the association rules.

 

Section 2 - Data Exploraration & Preparation

In this analysis, I will use the dataset from Instacart. The dataset can be found under this website. The data dictionary can found under this link.

After unzipping the dataset, there are 6 datasets being provided. Following are the datasets provided by the website:
order_products__prior
order_products__train
orders
products
aisles
departments

The details of each data can be found under the data dictionary.

 

Step 1 - Merge the data file

The transaction data is split into two files (i.e. ’order_products__prior’ and ’order_products__train’). ’order_products__train’ only contains a subset of customers with their last transaction with InstaCart.

Hence, JMP Pro is used to merge the two files to obtain all the transactions made


Fig 1: Combine the dataset

 

Step 2 - Summarise the product count by product id

As there are 33 millions of transactions in the daatset provided (Fig 2), there will be performance issue with my laptop. I will scale down the dataset by focusing the top 100 products with highest sales. This would account for about 20% of the transaction data (or about 7.8 millions transactions from the dataset).


Fig 2: Total Transaction Count

To do so, JMP Pro is used to summarize the data by product id level to count the number of the relevant products are sold.


Fig 3: Summary function

 

Step 3 - Sort the count descendingly

A new summary table will pop up once the transaction is being summarised by product id level. Next, right click on the product count column and choose ‘Sort’ > ‘Descending’ as shown under Fig 3 below.


Fig 4: Sorting descendingly on New Summary Table

 

Step 4 - Include the product name into the dataset

As the product id is not inituitive for users to know what products they are, I will perform a left join with ‘product’ table to show the product names under each product id. To do so, I will use join function under Rows in ribbons shown on top of JMP.


Fig 5: Left join ‘product’ onto summary table

 

Step 5 - ‘Clean up’ product names

Before subsetting the transactions, I noted that the product descriptions are pretty messy. For example, Fig 6 shows two different product naming for lemons despite that both of the products are lemons. This posed a challenge when performing market basket analysis as the algorithm will treat these two as different products. The rules might not be meaningful if this is not resolved.


Fig 6: Issue of Different Naming for Similar Products

To fix this, I have used the Recode function in JMP Pro to clean up the product names. Fig 9 shows the sample comparison of the product names before and after cleaning.


Fig 7: Recode function in JMP


Fig 8: A summary table to allow users to recode the data


Fig 9: Comparison of Product Name Before and After Recode

 

Step 6 - Subset out the top 100 products with higest sales

Once we have cleaned up the product names, we will proceed and subset the transactions of the top 100 products. This is done through Data View function (as shown under Fig 10) after selecting the top 100 products in JMP Pro.


Fig 10: Subset out top 100 products

 

Step 7 - Perform Inner Join to select out the relevant transactions

Then, I will perform an inner join to drop those transactions are not from the top 100 products. Similar step as Fig 5 is followed, except I will not tick the box to include non-matched.


Fig 11: Inner join function

Once the subset transaction is done, save the data file as csv. Then, park the dataset into the data folder of the R project.

 

Step 8 - Investigate the data types of each variable in the dataset

Next, we will use RStudio to check the data type and transform the data into transaction format.

To do so, following list of required packages in creating the visualization are being loaded RStudio environment.

The code chuck below will first search whether these pacakges are being installed in RStudio environment. If the relevant packages are not installed in R, they will be installed. Once the packages are available, the code will load them into the RStudio evironment.

package = c("dplyr", "tidyverse", "arules", "arulesViz", "igraph","visNetwork")

for (p in package){
  if(!require (p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Next, I will read the data file into RStudio.

# Read the data
data <- read_csv("data/Order with Products_Top 100 products(grp).csv")
## Parsed with column specification:
## cols(
##   order_id = col_double(),
##   product_id = col_double(),
##   add_to_cart_order = col_double(),
##   reordered = col_double(),
##   user_id = col_double(),
##   eval_set = col_character(),
##   order_number = col_double(),
##   order_dow = col_double(),
##   order_hour_of_day = col_double(),
##   days_since_prior_order = col_double(),
##   aisle_id = col_double(),
##   department_id = col_double(),
##   product_name = col_character()
## )

Next, I will inspect the data type of the different variables in the dataset.

# Check the dataset again after converting Order_ID into factor
str(data)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 7795471 obs. of  13 variables:
##  $ order_id              : num  2539329 2398795 2398795 473747 2254736 ...
##  $ product_id            : num  196 196 13176 196 196 ...
##  $ add_to_cart_order     : num  1 1 4 1 1 1 7 8 1 1 ...
##  $ reordered             : num  0 1 0 1 1 1 0 1 1 1 ...
##  $ user_id               : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ eval_set              : chr  "prior" "prior" "prior" "prior" ...
##  $ order_number          : num  1 2 2 3 4 5 5 5 6 7 ...
##  $ order_dow             : num  2 3 3 3 4 4 4 4 2 1 ...
##  $ order_hour_of_day     : num  8 7 7 12 7 15 15 15 7 9 ...
##  $ days_since_prior_order: num  NA 15 15 21 29 28 28 28 19 20 ...
##  $ aisle_id              : num  77 77 24 77 77 77 24 24 77 77 ...
##  $ department_id         : num  7 7 4 7 7 7 4 4 7 7 ...
##  $ product_name          : chr  "Soda" "Soda" "Banana" "Soda" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   order_id = col_double(),
##   ..   product_id = col_double(),
##   ..   add_to_cart_order = col_double(),
##   ..   reordered = col_double(),
##   ..   user_id = col_double(),
##   ..   eval_set = col_character(),
##   ..   order_number = col_double(),
##   ..   order_dow = col_double(),
##   ..   order_hour_of_day = col_double(),
##   ..   days_since_prior_order = col_double(),
##   ..   aisle_id = col_double(),
##   ..   department_id = col_double(),
##   ..   product_name = col_character()
##   .. )

Note that the ‘order_id’ is being read as a numeric figure. This is inaccurate as the ‘order_id’ represents the transaction id and they should be changed to nominal data. To do so, factor function is used to recode ‘order_id’ to nominal data.

# Clean up the data
data$order_id <- factor(data$order_id)

I will check the data again to ensure the ‘order_id’ is being changed accordingly.

# Check the dataset again after converting Order_ID into factor
str(data)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 7795471 obs. of  13 variables:
##  $ order_id              : Factor w/ 2444982 levels "1","2","3","5",..: 1814923 1714480 1714480 338811 1611524 308638 308638 308638 2406767 393436 ...
##  $ product_id            : num  196 196 13176 196 196 ...
##  $ add_to_cart_order     : num  1 1 4 1 1 1 7 8 1 1 ...
##  $ reordered             : num  0 1 0 1 1 1 0 1 1 1 ...
##  $ user_id               : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ eval_set              : chr  "prior" "prior" "prior" "prior" ...
##  $ order_number          : num  1 2 2 3 4 5 5 5 6 7 ...
##  $ order_dow             : num  2 3 3 3 4 4 4 4 2 1 ...
##  $ order_hour_of_day     : num  8 7 7 12 7 15 15 15 7 9 ...
##  $ days_since_prior_order: num  NA 15 15 21 29 28 28 28 19 20 ...
##  $ aisle_id              : num  77 77 24 77 77 77 24 24 77 77 ...
##  $ department_id         : num  7 7 4 7 7 7 4 4 7 7 ...
##  $ product_name          : chr  "Soda" "Soda" "Banana" "Soda" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   order_id = col_double(),
##   ..   product_id = col_double(),
##   ..   add_to_cart_order = col_double(),
##   ..   reordered = col_double(),
##   ..   user_id = col_double(),
##   ..   eval_set = col_character(),
##   ..   order_number = col_double(),
##   ..   order_dow = col_double(),
##   ..   order_hour_of_day = col_double(),
##   ..   days_since_prior_order = col_double(),
##   ..   aisle_id = col_double(),
##   ..   department_id = col_double(),
##   ..   product_name = col_character()
##   .. )

 

Step 9 - Transform the data into transaction format

Once we have finished cleaning the data, I will transform the data into transaction format so that we could find association rules between different items later.

data_list <- split(data$product_name, data$order_id)
data_trx <- as(data_list, "transactions")
## Warning in asMethod(object): removing duplicated items in transactions

Note that there is a warning message from R studio that the dulicated items in transactions are being removed. No issue on this as by taking order id 139227 as an example, we noted this customer has purchased different kinds of apples within the same basket (as shown under Fig 12). In this analysis, we will consider the different types of apples as ‘apples’. This is because if the catogerisation of the products are too granular, the derived rules from market basket analysis may not be meaningful.


Fig 12: Repeated transactions

Once the transaction data is created, we will check the first 5 records to ensure the data is being successfully transformed into the transaction format.

inspect(head(data_trx))
##     items          transactionID
## [1] {Avocado,                   
##      Banana,                    
##      Cheese,                    
##      Cucumber}                1 
## [2] {Carrots,                   
##      Kale}                    2 
## [3] {Almond Milk,               
##      Ginger,                    
##      Spinach}                 3 
## [4] {Avocado,                   
##      Banana,                    
##      Milk,                      
##      Raspberries}             5 
## [5] {Olive Oil}               9 
## [6] {Avocado,                   
##      Banana,                    
##      Beans,                     
##      Cilantro,                  
##      Half & Half,               
##      Onions,                    
##      Strawberries}            10

 

Section 3 - Market Basket Analysis

Once the transaction data is successfully created, we will use apriori function from arules package to find the association rules.

Over here, I will only output those association rules that fulfill following criteria:
- Minimum support of 1%
- Minimum confidence of 10%
- Minumum of 2 items in the basket

# find the association rules
rules <- apriori(data_trx, 
                 parameter = list(
                   supp = 0.01,
                   conf = 0.1,
                   minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 24449 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[50 item(s), 2444982 transaction(s)] done [1.47s].
## sorting and recoding items ... [50 item(s)] done [0.17s].
## creating transaction tree ... done [2.68s].
## checking subsets of size 1 2 3 4 done [0.40s].
## writing ... [347 rule(s)] done [0.00s].
## creating S4 object  ... done [0.71s].

As some of the smaller rules are actually a subset of the larger rules, I will remove them from the rules.

# remove the reducdant rules
subsetRules <- which(colSums(is.subset(rules, rules)) > 1)
rules_1 <- rules[-subsetRules]

Finally, the rules are presented in an interactive data table format so that the user could draw insights from the rules. Users can choose how they would like to filter the tables.

# convert the rules into interactive data table
inspectDT(rules_1)

 

Section 4 - Network Data Visualisation

Section 4.1 - Prototype

After cleaning and exploring the data, below is the prototype of the final visualisation:

The key highlight of the final visualisation is that I would incroporate the interactivity into the visualisation to allow the users to draw insights from the visualisation.

Next, I will plot a static graph and interactive graph to contast the advantages of using interactive graphs.

Section 4.2 - Static Graph

Plot is used to plot out the association rules of different items.

# plot the association rules
static_graph <- plot(rules_1, method = "graph")


Fig 13: Static Graph on Association Rules

Note that the graph is in image format, hence we are unable to interact with the graph. For exmaple, if I am interested in finding out the items are associated with avocado, it will be difficult to find out from the graph above.

 

Section 4.3 - Interactive Graph

So, to allow for interactivity, vizNetwork will be used to created an interactive graph.

# network visualisation
int_graph <- as_data_frame(static_graph, what = "both")

nodes <- data.frame(id = int_graph$vertices$name,
                    value = int_graph$vertices$lift,
                    color.background = int_graph$vertices$support,
                    int_graph$vertices)

edges <- data.frame(int_graph$edges)

visNetwork(nodes, edges) %>%
  visEdges(arrows ="to") %>%
  visNodes(color = list(background = int_graph$vertices$support)) %>%
  visOptions( highlightNearest = list(enabled =TRUE, degree = 2), nodesIdSelection = TRUE)


Fig 14: Interactive Graph on Association Rules

Note that I have also added the function to allow the users to highlight the association rules by product names. This will allow users to quickly identify the rules associated with the interested products.Alternatively, the user can click on the interested product name shown in the graph in order to highlight the relevant nodes.

Meanwhile, I find the nodes in Fig 14 are too bouncy. It is difficult for me to move the nodes around since the nodes would either bounce back or the whole graph will move together. To resolve this, I have added another vizIgraphLayout layer to the plotting function. Also, to ensure the reproducibility, I have set the random seed to be a fix number. This would allow my code to generate back the same graph.

nodes <- data.frame(id = int_graph$vertices$name,
                    value = int_graph$vertices$lift,
                    color.background = int_graph$vertices$support,
                    int_graph$vertices)

edges <- data.frame(int_graph$edges)

network_1 <- visNetwork(nodes, edges) %>%
              visEdges(arrows ="to") %>%
              visIgraphLayout(randomSeed = 12345) %>%
              visOptions( highlightNearest = list(enabled =TRUE, degree = 2), nodesIdSelection = TRUE)

network_1


Fig 15: Interactive Graph on Association Rules with Default Option in VisIgraphLayout

Besides, I have also attempted to plot the association rules in different layout types. This will allow us to compare what is the best way to illustrate the association rules of the data.

nodes <- data.frame(id = int_graph$vertices$name,
                    value = int_graph$vertices$lift,
                    color.background = int_graph$vertices$support,
                    int_graph$vertices)

edges <- data.frame(int_graph$edges)

network_2 <- visNetwork(nodes, edges) %>%
              visEdges(arrows ="to") %>%  
              visIgraphLayout(layout = "layout_in_circle",
                              randomSeed = 123) %>%
              visOptions( highlightNearest = list(enabled =TRUE, degree = 2), nodesIdSelection = TRUE)

network_2


Fig 16: Interactive Graph on Association Rules with Plotting Rules in Circle Format

nodes <- data.frame(id = int_graph$vertices$name,
                    value = int_graph$vertices$lift,
                    color.background = int_graph$vertices$support,
                    int_graph$vertices)

edges <- data.frame(int_graph$edges)

network_3 <- visNetwork(nodes, edges) %>%
              visEdges(arrows ="to") %>%  
              visIgraphLayout(layout = "layout_with_sugiyama") %>%
              visOptions( highlightNearest = list(enabled =TRUE, degree = 2), nodesIdSelection = TRUE)

network_3


Fig 17: Interactive Graph on Association Rules with Plotting Rules in Sugiyama Format

 

Section 5 - Final Visualisation

After the comparison of the different graph types, I have chosen this graph type. This is because I felt this graph is easier to read and easier for the users to make comparison.


Fig 18: Final Visualisation

Insight 1 - Bananas are associated with a lot of items, but the associations are realtively weaker if we compare to other items (eg. onions & avocado)

Under Fig 18, we saw that bananas are associated with lots of items sold by the company, however it doesn’t convey other info, such as how strong is the association rule, how frequent the combination of items appear in our basket and so on.

To better visualise how the different association compare in terms of support and lift, I have plotly function to plot the association rules with support as the x-axis and confidence as the y-axis as shown under Fig 19.

plotly_arules(rules_1, 
              method = "scatterplot", 
              shading = "confidence",
              measure = c("support", "lift"))
## Warning: 'plotly_arules' is deprecated.
## Use 'plot' instead.
## See help("Deprecated")


Fig 19: Assocation Rules

From the graph, we see that despite that banana has many association rules, but the lift of the rules are relatively lower. Unlike onions and avocado, they have relatively higher lift and lower support. So, it seems like there is some tradeoff over here. If we would like to have products with high association rules, we will need to “sacrifice” some support.

Ideally we would like to find the association rules that have high lift and high support. So, we should discuss with the business units to determine the requirements, eg. what is minimum support and lift required, which measurement is more important and so on.

Insight 2 - Compared to fruits, vegetables generally tend to be less associated with other vegetables

From the illustrated network, we noted that the vegetables tend to be less likely to be associated to other vegetables. In other words, if the basket already contain one type of veegtables, its less likely to find another different types of vegetables in the basket.

For example, Fig 20 shows the association rules for Broccoli and Grapes. From the graph, we can see that grapes are also associated with other fruits, whereas broccoli is only associated with the non-vegetables items.

One possible explanation is customers view the different types of fruits as complement to one another. In other words, the customers are likely to buy other types of fruits if they buy one types of fruits. In contrast, the customers view different types of vegetables as substitute. They are less likely to buy other types of vegetables once they have vegetables in their basket.

This is crucial especially when the company comes up with up-selling and cross-selling strategies. By taking the two products in Fig 20 as an example, if the company gives discount on grapes, this will not just increase the sales of grapes, we would probably observe an increase in the sales for products that are closely associated with grapes. If the company decided to give promotion on broccoli instead, we might see an increase in broccoli sales and a decrease in sales for other vegetables.

*Fig 20: Association Rules for Broccoli and Grapes**Fig 20: Association Rules for Broccoli and Grapes*

Fig 20: Association Rules for Broccoli and Grapes

Insight 3 - Despite that vegetables are less associated in general, some vegetables are strongly associated with one another

In general, I observed that vegetables are less associated to one another. However, there are some vegetables are strongly with one another.

By sorting the lift descendingly from the interactive data table as shown under Fig 21, I noted onions are strongly associated with zucchini and kale. This could suggest onions are complement goods for the mentioned items.


Fig 21: Top 5 Association Rules (Sort Descending by Lift)

 

Section 6 - Reflection on Static Graph vs Interactive Graph

Reflection 1 - Interactivity allows us to highlight the data in the graph

By including the interactivity, I could highlight the item I am interested and draw insights from the graphs plotted.

 

Reflection 2 - Interactivity also allows me to re-arrange the data points in the graph

Sometimes, the data points shown in the graph might be overlapped as shown in the graph below. This makes it impossible to identify the association rules for some items. For example, I am unable to read the association rules for avocado as the rules are overlapping on top one another.


Fig 22: Static Graph on Association Rules

By allowing interactivity, I am able to adjust the data points in the graph.


Fig 23: Final Graph on Association Rules

 

Reflection 3 - Dropdown box under interactivity function allows the users to quickly locate the item of interest

In the final visualisation, I have included the dropdown box. Hence, instead of searching through the graph, this would allows the users to quickly locate the items they are interested in.


Fig 24: Dropdown Box in Final Visualisation