Association Rules

Introduction

In this project, I will analyse association rules among subcategories of supermarket products. The main aim is to give insights to marketing department for shelf design and promotions.

The data has been acquired from the Kaggle website (V. Chowdhury 2022). The data consists of several columns about customers, branches, and product. However for Market Basket Analysis we need only two columns from the data:

First one is Sub-Category. I use Sub-Category column instead of product because there are extensive range of products. This much detail will not give us a clear insights. In further steps, one can filter out for several sub-categories and make a market basket analysis with product names.
Second is Customer.ID. As the transaction of each subcategory has been shown in separate rows, we need Customer.ID to detect the transaction pathes of customers.

data <- read.csv2("Sample - Superstore.csv", sep = ",")
str(data)

## 'data.frame':    9994 obs. of  21 variables:
##  $ Row.ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID     : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date   : chr  "11/8/2016" "11/8/2016" "6/12/2016" "10/11/2015" ...
##  $ Ship.Date    : chr  "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
##  $ Ship.Mode    : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID  : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Customer.Name: chr  "Claire Gute" "Claire Gute" "Darrin Van Huff" "Sean O'Donnell" ...
##  $ Segment      : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Country      : chr  "United States" "United States" "United States" "United States" ...
##  $ City         : chr  "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
##  $ State        : chr  "Kentucky" "Kentucky" "California" "Florida" ...
##  $ Postal.Code  : int  42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
##  $ Region       : chr  "South" "South" "West" "South" ...
##  $ Product.ID   : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category     : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category : chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name : chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales        : chr  "261.96" "731.94" "14.62" "957.5775" ...
##  $ Quantity     : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount     : chr  "0" "0" "0" "0.45" ...
##  $ Profit       : chr  "41.9136" "219.582" "6.8714" "-383.031" ...

data <- subset(data, select = c(Customer.ID, Sub.Category))
str(data)

## 'data.frame':    9994 obs. of  2 variables:
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...

summary(data)

##  Customer.ID        Sub.Category      
##  Length:9994        Length:9994       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

Data Preparation

As Apriori algorithm use special type of data (transaction), we need to convert our data.

trans <- as(split(data[,"Sub.Category"], data[,"Customer.ID"]), "transactions")

## Warning in asMethod(object): removing duplicated items in transactions

summary(trans)

## transactions as itemMatrix in sparse format with
##  793 rows (elements/itemsets/transactions) and
##  17 columns (items) and a density of 0.4452192 
## 
## most frequent items:
##     Binders       Paper Furnishings     Storage      Phones     (Other) 
##         650         611         528         514         511        3188 
## 
## element (itemset/transaction) length distribution:
## sizes
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
##   7  15  27  46  71 103 119 114  99  94  53  22  20   1   2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   8.000   7.569   9.000  15.000 
## 
## includes extended item information - examples:
##        labels
## 1 Accessories
## 2  Appliances
## 3         Art
## 
## includes extended transaction information - examples:
##   transactionID
## 1      AA-10315
## 2      AA-10375
## 3      AA-10480

inspect(trans[1:10])

##      items          transactionID
## [1]  {Accessories,               
##       Appliances,                
##       Binders,                   
##       Fasteners,                 
##       Furnishings,               
##       Paper,                     
##       Phones,                    
##       Storage,                   
##       Supplies}          AA-10315
## [2]  {Accessories,               
##       Art,                       
##       Binders,                   
##       Furnishings,               
##       Paper,                     
##       Phones,                    
##       Storage}           AA-10375
## [3]  {Accessories,               
##       Art,                       
##       Furnishings,               
##       Paper,                     
##       Phones,                    
##       Storage,                   
##       Tables}            AA-10480
## [4]  {Art,                       
##       Binders,                   
##       Bookcases,                 
##       Chairs,                    
##       Envelopes,                 
##       Furnishings,               
##       Paper,                     
##       Phones,                    
##       Storage}           AA-10645
## [5]  {Art,                       
##       Bookcases,                 
##       Chairs,                    
##       Phones,                    
##       Storage}           AB-10015
## [6]  {Accessories,               
##       Appliances,                
##       Binders,                   
##       Chairs,                    
##       Furnishings,               
##       Paper,                     
##       Supplies,                  
##       Tables}            AB-10060
## [7]  {Accessories,               
##       Art,                       
##       Binders,                   
##       Furnishings,               
##       Labels,                    
##       Machines,                  
##       Phones,                    
##       Storage,                   
##       Tables}            AB-10105
## [8]  {Accessories,               
##       Art,                       
##       Binders,                   
##       Furnishings,               
##       Paper,                     
##       Supplies}          AB-10150
## [9]  {Accessories,               
##       Art,                       
##       Binders,                   
##       Chairs,                    
##       Furnishings,               
##       Paper,                     
##       Storage}           AB-10165
## [10] {Accessories,               
##       Art,                       
##       Binders,                   
##       Copiers,                   
##       Fasteners,                 
##       Furnishings,               
##       Paper,                     
##       Phones,                    
##       Storage,                   
##       Supplies}          AB-10255

Before making analyzes, let’s first look for our sub-category frequency distribution.

itemFrequencyPlot(trans, topN = 17, type="absolute", main="Item Frequency")

itemFrequencyPlot(trans, topN = 17, type="relative", main="Item Frequency")

As you can see, the most purchased sub-categories are binders and papers.

Apriori Algorithm

rules <- apriori(trans, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[17 item(s), 793 transaction(s)] done [0.00s].
## sorting and recoding items ... [17 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10

## Warning in apriori(trans, parameter = list(support = 0.006, confidence = 0.25, :
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!

##  done [0.02s].
## writing ... [176470 rule(s)] done [0.04s].
## creating S4 object  ... done [0.11s].

inspect(rules[1:10])

##      lhs          rhs           support    confidence coverage   lift     
## [1]  {Copiers} => {Bookcases}   0.02143758 0.265625   0.08070618 1.0802083
## [2]  {Copiers} => {Envelopes}   0.02774275 0.343750   0.08070618 1.3232706
## [3]  {Copiers} => {Tables}      0.02269861 0.281250   0.08070618 0.8545259
## [4]  {Copiers} => {Labels}      0.02648172 0.328125   0.08070618 0.9259898
## [5]  {Copiers} => {Appliances}  0.03404792 0.421875   0.08070618 0.9397384
## [6]  {Copiers} => {Chairs}      0.04665826 0.578125   0.08070618 1.1264205
## [7]  {Copiers} => {Accessories} 0.04539723 0.562500   0.08070618 0.9410601
## [8]  {Copiers} => {Art}         0.05800757 0.718750   0.08070618 1.1537829
## [9]  {Copiers} => {Phones}      0.06052963 0.750000   0.08070618 1.1638943
## [10] {Copiers} => {Storage}     0.05548550 0.687500   0.08070618 1.0606761
##      count
## [1]  17   
## [2]  22   
## [3]  18   
## [4]  21   
## [5]  27   
## [6]  37   
## [7]  36   
## [8]  46   
## [9]  48   
## [10] 44

Explanation of result:

lhs - The first product(s) in purchasing path or IF product also known as antecedent.
rhs - The following product(s) in purchasing path or THEN product also know as consequent.
Support - “how frequent an itemsetis in all the transactions”. (A. Garg 2018)
Confidence - “the likeliness of occurrence of consequent on the cart given that the cart already has the antecedents.” (A. Garg 2018)
Lift - “the support (frequency) of consequent while calculating the conditional probability of occurrence of {Y} given {X}.” (A. Garg 2018)
Coverage - “the support of the left-hand-side of the rule X => Y.” (rdrr.io)

In the below tables you can see the top 5 rules sorted for each element of result.

inspect(sort(rules, by = "lift")[1:5])

##     lhs              rhs            support confidence    coverage     lift count
## [1] {Chairs,                                                                     
##      Envelopes,                                                                  
##      Fasteners,                                                                  
##      Paper,                                                                      
##      Phones,                                                                     
##      Supplies}    => {Machines} 0.007566204  0.8571429 0.008827238 6.865801     6
## [2] {Appliances,                                                                 
##      Chairs,                                                                     
##      Envelopes,                                                                  
##      Fasteners,                                                                  
##      Paper,                                                                      
##      Phones,                                                                     
##      Supplies}    => {Machines} 0.007566204  0.8571429 0.008827238 6.865801     6
## [3] {Binders,                                                                    
##      Chairs,                                                                     
##      Envelopes,                                                                  
##      Fasteners,                                                                  
##      Paper,                                                                      
##      Phones,                                                                     
##      Supplies}    => {Machines} 0.007566204  0.8571429 0.008827238 6.865801     6
## [4] {Appliances,                                                                 
##      Binders,                                                                    
##      Chairs,                                                                     
##      Envelopes,                                                                  
##      Fasteners,                                                                  
##      Paper,                                                                      
##      Phones,                                                                     
##      Supplies}    => {Machines} 0.007566204  0.8571429 0.008827238 6.865801     6
## [5] {Appliances,                                                                 
##      Chairs,                                                                     
##      Envelopes,                                                                  
##      Fasteners,                                                                  
##      Phones,                                                                     
##      Storage,                                                                    
##      Supplies}    => {Machines} 0.006305170  0.8333333 0.007566204 6.675084     5

inspect(sort(rules, by = "confidence")[1:5])

##     lhs                            rhs       support     confidence coverage   
## [1] {Copiers, Machines}         => {Binders} 0.015132409 1          0.015132409
## [2] {Copiers, Supplies}         => {Binders} 0.017654477 1          0.017654477
## [3] {Copiers, Labels, Machines} => {Phones}  0.007566204 1          0.007566204
## [4] {Copiers, Labels, Machines} => {Paper}   0.007566204 1          0.007566204
## [5] {Copiers, Labels, Machines} => {Binders} 0.007566204 1          0.007566204
##     lift     count
## [1] 1.220000 12   
## [2] 1.220000 14   
## [3] 1.551859  6   
## [4] 1.297872  6   
## [5] 1.220000  6

inspect(sort(rules, by = "support")[1:5])

##     lhs              rhs           support   confidence coverage  lift    
## [1] {Paper}       => {Binders}     0.6506936 0.8445172  0.7704918 1.030311
## [2] {Binders}     => {Paper}       0.6506936 0.7938462  0.8196721 1.030311
## [3] {Furnishings} => {Binders}     0.5535939 0.8314394  0.6658260 1.014356
## [4] {Binders}     => {Furnishings} 0.5535939 0.6753846  0.8196721 1.014356
## [5] {Storage}     => {Binders}     0.5460277 0.8424125  0.6481715 1.027743
##     count
## [1] 516  
## [2] 516  
## [3] 439  
## [4] 439  
## [5] 433

inspect(sort(rules, by = "count")[1:5])

##     lhs              rhs           support   confidence coverage  lift    
## [1] {Paper}       => {Binders}     0.6506936 0.8445172  0.7704918 1.030311
## [2] {Binders}     => {Paper}       0.6506936 0.7938462  0.8196721 1.030311
## [3] {Furnishings} => {Binders}     0.5535939 0.8314394  0.6658260 1.014356
## [4] {Binders}     => {Furnishings} 0.5535939 0.6753846  0.8196721 1.014356
## [5] {Storage}     => {Binders}     0.5460277 0.8424125  0.6481715 1.027743
##     count
## [1] 516  
## [2] 516  
## [3] 439  
## [4] 439  
## [5] 433

The better way to see results is to use graphs.

plot(sort(rules, by = "confidence")[1:100], method="graph")

plot(rules, shading="order", control=list(main="Two-key plot"))

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Association Rules

Nijat Gurbanov

3/1/2022

Introduction

Data Preparation

Apriori Algorithm