Network Analysis for Association Rules: Online Retail

Introduction

Motivation

E-commerce and online business transaction are growing every year. According to bigcommerce , there are over 5 billion items shipped worldwide in 2017, with more than 12 million varieties of products sold from Amazon alone. Many retailers are now focused to sell their products online, as can be seen that more than 50% of all sales come from third-party sellers, who also sell their products on other platform outside of Amazon. Soon, more people will buy things online instead of going to the conventional mortar and bricks store.

Great online shopping experience can further boost the sales. One of the most promising features to enhance the shopping experience is by applying recommender system, where the website or application give recommendation on what people may buy based on their current shopping list or past purchases. Another interesting way to attract customer is by creating a themed category, where promoted collection of items have high likelihood to be bought together, such as kitchen set or christmas decoration set. One way to implement them is by using a data mining method called Association Rules. Here we will learn how to apply association rules in online retail transactions and visualize them using network graph.

Objectives

  • Understand Association Rules
  • Understand Network Analysis
  • Propose Business Plan for Online Retail Sales

Association Rule

Concepts

Association rule is a method to find rules that define association or coocurrence between two or more items/objects. For example, people who bought bread are likley to buy milk as well, or a father who just had a baby may by diapers alongside their own beverage, such as beers. In convenience stores, association rules can give advantage to the store manager, as they can manage or adjust the product placement that will optimize their sales, since they knew what products should be placed adjacently so the likelihood that people will buy a combination of products will increase. In online store or any website in general, association rules can be used as a base for recommender system, such as what movies you are likely to watch if you currently watch Star Wars, or recommending songs on spotify.

In order to measure the power of each rules and assess which rules are worthy enough to be considered, there are 3 metrics that we can use: support, confidence, and lift.

Support

Support represent the ratio between the number of transaction which contain both item a and b with the number of all transactions. The higher the support, the combination of two or more items would appear more frequent in the dataset.

\[ Support \ Rule(a =>b) = \frac{Number \ of \ transaction \ containing \ a \ and \ b} {Number \ of \ all \ transaction}\]

Confidence

Confidence represent the probability of combination item a and b would appear together if we know that the customer buy item a.

\[ Confidence \ Rule(a =>b) = \frac{Number \ of \ transaction \ containing \ a \ and \ b} {Number \ of \ transaction \ containing \ a}\]

Lift

Lift represent how much the present of item a increase the confidence that people will buy item b.

\[ Lift \ Rule(a =>b) = \frac{Confidence \ Rule(a => b)} {Support (b)} \]

Where

\[ Support(b) = \frac {Number \ of \ transaction \ containing \ b} {Number \ of \ all \ transaction}\]

Association Rule with arules package

Import Data

Here is the original data I acquired through Kaggle (https://www.kaggle.com/jihyeseo/online-retail-data-set-from-uci-ml-repo). I will transform the data into format that fits for association rules.

The data is transactions record occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts.

We preprocessed the data

The function below are used to transform the data into market basket format (transactions). This process may take a while. If you want to get the result directly, here is the link .

We can directly transform the data into transactions using as(df_prep, Class = "transactions"). Here we inspect at the first 4 transactions by using LIST() function. Each list is a single transaction. However, if we use this method, the names of the column is pasted with the item name. This will affect the model since we also look at the order of the item in the basket. It would be better if we remove the column name and simply retrieve the item name.

$`1`
[1] "item_1=white hanging heart t-light holder" 
[2] "item_2=white metal lantern"                
[3] "item_3=cream cupid hearts coat hanger"     
[4] "item_4=knitted union flag hot water bottle"
[5] "item_5=red woolly hottie white heart."     
[6] "item_6=set 7 babushka nesting boxes"       
[7] "item_7=glass star frosted t-light holder"  

$`2`
[1] "item_1=hand warmer union jack"    "item_2=hand warmer red polka dot"

$`3`
 [1] "item_1=assorted colour bird ornament"     
 [2] "item_2=poppys playhouse bedroom"          
 [3] "item_3=poppys playhouse kitchen"          
 [4] "item_4=feltcraft princess charlotte doll" 
 [5] "item_5=ivory knitted mug cosy"            
 [6] "item_6=box of 6 assorted colour teaspoons"
 [7] "item_7=box of vintage jigsaw blocks"      
 [8] "item_8=box of vintage alphabet blocks"    
 [9] "item_9=home building block word"          
[10] "item_10=love building block word"         
[11] "item_11=recipe box with metal heart"      
[12] "item_12=doormat new england"              

$`4`
[1] "item_1=jam making set with jars"      
[2] "item_2=red coat rack paris fashion"   
[3] "item_3=yellow coat rack paris fashion"
[4] "item_4=blue coat rack paris fashion"  

So instead we will save the data into .csv format and use read.transasctions() function from arules packages to read data as transaction object.

[[1]]
[1] "cream cupid hearts coat hanger"      "glass star frosted t-light holder"  
[3] "knitted union flag hot water bottle" "red woolly hottie white heart."     
[5] "set 7 babushka nesting boxes"        "white hanging heart t-light holder" 
[7] "white metal lantern"                

[[2]]
[1] "hand warmer red polka dot" "hand warmer union jack"   

[[3]]
 [1] "assorted colour bird ornament"      "box of 6 assorted colour teaspoons"
 [3] "box of vintage alphabet blocks"     "box of vintage jigsaw blocks"      
 [5] "doormat new england"                "feltcraft princess charlotte doll" 
 [7] "home building block word"           "ivory knitted mug cosy"            
 [9] "love building block word"           "poppys playhouse bedroom"          
[11] "poppys playhouse kitchen"           "recipe box with metal heart"       

[[4]]
[1] "blue coat rack paris fashion"   "jam making set with jars"      
[3] "red coat rack paris fashion"    "yellow coat rack paris fashion"

The data consists of about 25,900 number of transactions. The first transaction have 7 items purchased, while the second transaction only consists of 2 items. Now let’s start create association rules to see what products are good to sell together or recommended.

Create Rules

Let’s create association rule. We limit the created rule only limited to support minimal 0.01 and confidence minimal 0.7.

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.7    0.1    1 none FALSE            TRUE       5    0.01      1
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 259 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[4193 item(s), 25900 transaction(s)] done [0.41s].
sorting and recoding items ... [590 item(s)] done [0.02s].
creating transaction tree ... done [0.02s].
checking subsets of size 1 2 3 4 done [0.08s].
writing ... [82 rule(s)] done [0.00s].
creating S4 object  ... done [0.01s].

Explore the Rules

Let’s look at the rules that has the highest confidence.

    lhs                                     rhs                                  support confidence     lift count
[1] {pink regency teacup and saucer,                                                                              
     regency cakestand 3 tier,                                                                                    
     roses regency teacup and saucer}    => {green regency teacup and saucer} 0.01169884  0.8991098 22.03117   303
[2] {regency tea plate pink}             => {regency tea plate green}         0.01088803  0.8980892 60.26039   282
[3] {set/20 red retrospot paper napkins,                                                                          
     set/6 red spotty paper cups}        => {set/6 red spotty paper plates}   0.01023166  0.8952703 43.99905   265
[4] {pink regency teacup and saucer,                                                                              
     roses regency teacup and saucer}    => {green regency teacup and saucer} 0.02119691  0.8941368 21.90931   549
[5] {green regency teacup and saucer,                                                                             
     pink regency teacup and saucer,                                                                              
     regency cakestand 3 tier}           => {roses regency teacup and saucer} 0.01169884  0.8757225 20.25108   303
[6] {jumbo bag pink polkadot,                                                                                     
     jumbo shopper vintage red paisley,                                                                           
     jumbo storage bag suki}             => {jumbo bag red retrospot}         0.01038610  0.8677419 10.52671   269

A lot of the rules consists of 3 items for the precedent and has high confidence (more than 0.85). For example, according to the first rule, if the people buy pink regency teacup and saucer, regency cakestand 3 tier, and roses regency teacup and saucer, it is likely that people will also buy green regency teacup and saucer. There are 303 transactions that have these combination of items, which if we divide them by the total number of transaction record (25900), we get the support value of 0.01169884. The confidence that the appearance of the first 3 items will lead to high likelihood that people will also buy green regency teacup and saucer. The lift value from the rule is more than 1 and quite high, suggesting that the presence of the first 3 items have a big impact toward the increase of the confidence value. According to the second rule, if people buy regency tea plate pink, people will also buy regency tea plate green. The second rule has bigger lift compared to the first rule.

Let’s look at the rules that has the highest lift.

     lhs                                     rhs                                support confidence     lift count
[1]  {regency tea plate pink}             => {regency tea plate green}       0.01088803  0.8980892 60.26039   282
[2]  {regency tea plate green}            => {regency tea plate pink}        0.01088803  0.7305699 60.26039   282
[3]  {regency tea plate pink}             => {regency tea plate roses}       0.01050193  0.8662420 49.09337   272
[4]  {poppys playhouse livingroom}        => {poppys playhouse bedroom}      0.01011583  0.7939394 48.27002   262
[5]  {regency tea plate green}            => {regency tea plate roses}       0.01243243  0.8341969 47.27724   322
[6]  {regency tea plate roses}            => {regency tea plate green}       0.01243243  0.7045952 47.27724   322
[7]  {poppys playhouse livingroom}        => {poppys playhouse kitchen}      0.01015444  0.7969697 46.91253   263
[8]  {set/20 red retrospot paper napkins,                                                                        
      set/6 red spotty paper cups}        => {set/6 red spotty paper plates} 0.01023166  0.8952703 43.99905   265
[9]  {poppys playhouse kitchen}           => {poppys playhouse bedroom}      0.01212355  0.7136364 43.38775   314
[10] {poppys playhouse bedroom}           => {poppys playhouse kitchen}      0.01212355  0.7370892 43.38775   314

Rules that has lift means that the presence of items purchased such as the regency tea plate pink can actually lead to increase in confidence that people will buy regency tea plate green. How can we benefit from this rules? Since this is online retail transactions, we can consider to set the rules as a recommender system, like showing the regency tea plate roses on the bottom of the screen when people putting regency tea plate pink on their cart. We may make a bundle package or collector edition for regency tea plate varieties to increase sales since the majority of the top 10 rules are associated with regency tea plate items.

Pink Regency Tea Set

Pink Regency Tea Set

Analyzing the rules in form of text or tabular may not be the best way and can be troublesome. If we have a lot of rules, it’s hard too see the relationship between rules or items that correspond them. Some rules may create a network or chain with other rules. This insight can only be seen if we visualize the rules in meaningful way. Therefore, we will do network analysis for the association rules.

Network Analysis

Concepts

A network/graphs is a method to visualize relationship between discrete objects. There are several types of network analysis: electrical network analysis, social network analysis, biological network analysis, link analysis, etc. Generally, a network is consists of two parts: nodes and edges.

  • Nodes

A nodes is a single point or circle that represent a discrete component or object. The size or color of the nodes may represent any numerical value to signify the importance of the nodes, such as how big is the frequency of item a in the dataset.

  • Edges

An edges is a line that represent the relationship between two nodes/objects. If the relationship has directions, such as causal effect (item a increase the probability of sales of item b or the present of item a increase the likelihood of the presence of item b), the line become an arrow.

There are many ways to draw graph/network in R. Here, we will illustrate how to draw network for association rules with 2 kind of packages: ArulesViz, visNetwork. I also put extra material with networkD3 package if you want to explore more.

Create Network with ArulesViz

Let’s create network for 50 rules that has the highest lift

    lhs                              rhs                           support confidence     lift count
[1] {regency tea plate pink}      => {regency tea plate green}  0.01088803  0.8980892 60.26039   282
[2] {regency tea plate green}     => {regency tea plate pink}   0.01088803  0.7305699 60.26039   282
[3] {regency tea plate pink}      => {regency tea plate roses}  0.01050193  0.8662420 49.09337   272
[4] {poppys playhouse livingroom} => {poppys playhouse bedroom} 0.01011583  0.7939394 48.27002   262
[5] {regency tea plate green}     => {regency tea plate roses}  0.01243243  0.8341969 47.27724   322
[6] {regency tea plate roses}     => {regency tea plate green}  0.01243243  0.7045952 47.27724   322

Drawing a network using the ArulesViz package is straightforward. We simply need to use plot() function. The default plot is drawn using igraph package.

We can make more interactive plot by changing the engine parameter into htmlwidget

The circle represent each rules, with arrow heading toward them is the antecedent/precedent and the arrow our from them represent the consequence. For example, in Rule 1 we can see that the antecedent is regency tea plate pink while the consequence is regency tea plate green, meaning that in Rule 1, people who buy regency tea plate pink are likely to buy regency tea plate green as well. The detailed information can be seen from the tooltip when we hover on the nodes or text.

Using ArulesViz is an easy way if we want to quickly create graph/network. However, they are hard to customize. Next, we will try to customize and build the network from scratch with visNetwok.

Create Network with visNetwork

Before we create new network, first we must retrieve the rules information first. We can use DATAFRAME() function to retrieve the rules in form of data frame.

Let’s tidy the data to separate the LHS into several columns, since some rules have more than 1 items embedded.

Let’ create the nodes. We will directly link the items without showing the rules

Let’s create the edges and remove redundant direction.

We have prepared the nodes and the edges for our network. Next, we use visNetwork() function to build the network.

Analyze the Network

Let’s get some insights from the network we’ve extracted from the 50 rules with highest lift.

  • There are at several rules that don’t form network, such as rules 14 {small marshmallows pink bowl => small dolly mix design orange bowl} and rules 17 {toilet metal sign => bathroom metal sign}.
  • Some products from regency tea set product form a network consists of at leats 12 interconnected rules. The item regency cakestand 3 tier become the antecedent of many rules, while the item roses regency teacup and saucer and green regency teacup and saucer become antecedent as well as consequence of many rules. The network only consists of 4 items purchased, suggesting that items are influencing each other probability to be bought. The key is item regency cakestand 3 tier, since it doesn’t become a consequence. If we can make people buy this product, chances are people will also buy the rest of them. We may want to create a special package consists of those 4 items, since people are likely to buy all of them together.
  • Another big network is some product related to charlotte bag, consists of 17 rules. charlotte bag pink polkadot and strawberry charlotte bag become for many rules, while item red retrospot charlotte bag become consequence of many rules as well as an antecedent. The product mostly consists of bag, so we may want to create a collector edition or give discount or any other promo if people already has one of them.

Conclusion

Association rules can be used to help store manager or online merchant to increase their sales. Understanding the association or co-occurence between items will help us plan what promo or recommendation we will give to people based on their purchases. Network analysis help further help us find more insight compared to if only we look at the rules individually.

Extra

Create Network with networkD3

networkD3 is another package that you can use to visualize your data into network. Personally, I think the graph visualized is more beautiful than visNetwork.

Create Edges and Nodes for d3

We will use the same notes and edges from the previous process.

Create Network

Use forceNetwork() function to build the network

Sankey Diagram

Alternatively, we can visualize the flow or direction from the antecedent to the consequence using Sankey Diagram.

2019-12-06