Problem statement :

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Lets inspect the dataset we have in GroceryDataSet.csv.

dataset = read.csv('C:\\Users\\charls.joseph\\Documents\\Cuny\\Data624\\week13\\GroceryDataSet.csv', header = FALSE)
nrow(dataset)
## [1] 9835
head(dataset)
##                 V1                  V2             V3                       V4
## 1     citrus fruit semi-finished bread      margarine              ready soups
## 2   tropical fruit              yogurt         coffee                         
## 3       whole milk                                                            
## 4        pip fruit              yogurt  cream cheese              meat spreads
## 5 other vegetables          whole milk condensed milk long life bakery product
## 6       whole milk              butter         yogurt                     rice
##                 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1                                                                             
## 2                                                                             
## 3                                                                             
## 4                                                                             
## 5                                                                             
## 6 abrasive cleaner                                                            
##   V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
## 1                                            
## 2                                            
## 3                                            
## 4                                            
## 5                                            
## 6

There are 9835 observations.

Converting this into a Sparse matrix. We are trying the apriori algorithm and it takes the sparse matrix as input type. The columns of the sparse matrix is product type and rows contains indication of whether a product is purchased in a transaction.

dataset = read.transactions('C:\\Users\\charls.joseph\\Documents\\Cuny\\Data624\\week13\\GroceryDataSet.csv', sep = ',', rm.duplicates = TRUE)

Lets look at the top 10 purchased products.

itemFrequencyPlot(dataset, topN = 10)

Lets train the apriori to extract rules by defining the minimim support and confidence value.

What is support and how can the minimum support value be selected ?

Suppose you have a product p1 and the support of that product(p1) is total # of product p1 purchased over the total # of observation. This is basically the likelihood of the product to be purchased. we dont want products which are rarely purchased. so we need to set a minimum support value to start with the aprori algorithm.

we need products that are purchased at least 5 times in a day. We will assume that the given data is captured for a week’s time. Hence minimum support is calculated for a week is below

min_suport <- 6 * 7/ nrow(dataset)
min_suport
## [1] 0.004270463

What is confidence and how can the minimum confidence value be selected ?

Here, we are extracting rules between different products. Confidence is the likelihood of a product p2 to be purchased if another product p1 is already purchased.

Confidence(p1 -> p2) = # of observation where p1 and p2 purchased/ # of observation where p1 purchased

We will start with the default value(0.6) and evaluate the rules. If the rules are not making sense, we will further reduce until we get a sensible rules.

# Training Apriori on the dataset
rules = apriori(data = dataset, parameter = list(support = 0.004, confidence = 0.6))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.004      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 39 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [40 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

In next step, we are inspecting the top 10 rules extracted. I sorted the rules by the decreasing lift. Lift is a metric that indicate the significance of the rules.

inspect(sort(rules, by = 'lift')[1:10])
##      lhs                     rhs                    support confidence    coverage     lift count
## [1]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       tropical fruit}     => {other vegetables} 0.004473818  0.7857143 0.005693950 4.060694    44
## [2]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       whole milk}         => {other vegetables} 0.005795628  0.6333333 0.009150991 3.273165    57
## [3]  {pip fruit,                                                                                 
##       root vegetables,                                                                           
##       whole milk}         => {other vegetables} 0.005490595  0.6136364 0.008947636 3.171368    54
## [4]  {root vegetables,                                                                           
##       tropical fruit,                                                                            
##       yogurt}             => {other vegetables} 0.004982206  0.6125000 0.008134215 3.165495    49
## [5]  {pip fruit,                                                                                 
##       whipped/sour cream} => {other vegetables} 0.005592272  0.6043956 0.009252669 3.123610    55
## [6]  {onions,                                                                                    
##       root vegetables}    => {other vegetables} 0.005693950  0.6021505 0.009456024 3.112008    56
## [7]  {curd,                                                                                      
##       domestic eggs}      => {whole milk}       0.004778851  0.7343750 0.006507372 2.874086    47
## [8]  {butter,                                                                                    
##       curd}               => {whole milk}       0.004880529  0.7164179 0.006812405 2.803808    48
## [9]  {tropical fruit,                                                                            
##       whipped/sour cream,                                                                        
##       yogurt}             => {whole milk}       0.004372140  0.7049180 0.006202339 2.758802    43
## [10] {root vegetables,                                                                           
##       tropical fruit,                                                                            
##       yogurt}             => {whole milk}       0.005693950  0.7000000 0.008134215 2.739554    56

Looking at the above results, the rules are not making sense. Lets reduce the minimum confidence value from 0.6 to 0.4 and see how the rules are getting extracted.

# Training Apriori on the dataset
rules = apriori(data = dataset, parameter = list(support = 0.004, confidence = 0.4))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.004      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 39 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [432 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Visualising the results
inspect(sort(rules, by = 'lift')[1:10])
##      lhs                     rhs                    support confidence    coverage     lift count
## [1]  {liquor}             => {bottled beer}     0.004677173  0.4220183 0.011082867 5.240594    46
## [2]  {herbs,                                                                                     
##       whole milk}         => {root vegetables}  0.004168785  0.5394737 0.007727504 4.949369    41
## [3]  {citrus fruit,                                                                              
##       other vegetables,                                                                          
##       tropical fruit}     => {root vegetables}  0.004473818  0.4943820 0.009049314 4.535678    44
## [4]  {citrus fruit,                                                                              
##       other vegetables,                                                                          
##       root vegetables}    => {tropical fruit}   0.004473818  0.4313725 0.010371124 4.110997    44
## [5]  {citrus fruit,                                                                              
##       other vegetables,                                                                          
##       whole milk}         => {root vegetables}  0.005795628  0.4453125 0.013014743 4.085493    57
## [6]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       tropical fruit}     => {other vegetables} 0.004473818  0.7857143 0.005693950 4.060694    44
## [7]  {herbs}              => {root vegetables}  0.007015760  0.4312500 0.016268429 3.956477    69
## [8]  {tropical fruit,                                                                            
##       whipped/sour cream,                                                                        
##       whole milk}         => {yogurt}           0.004372140  0.5512821 0.007930859 3.951792    43
## [9]  {citrus fruit,                                                                              
##       pip fruit}          => {tropical fruit}   0.005592272  0.4044118 0.013828165 3.854060    55
## [10] {whipped/sour cream,                                                                        
##       whole milk,                                                                                
##       yogurt}             => {tropical fruit}   0.004372140  0.4018692 0.010879512 3.829829    43

Rules is improved, however we are getting some random rules which are not making sense. For example. citrus fruit is coming in multiple rules in combination root vegetables. Lets reduce the minimum value further to 0.2 this time.

# Training Apriori on the dataset
rules = apriori(data = dataset, parameter = list(support = 0.004, confidence = 0.2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.004      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 39 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [1268 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Visualising the results
inspect(sort(rules, by = 'lift')[1:10])
##      lhs                     rhs                      support confidence    coverage     lift count
## [1]  {flour}              => {sugar}              0.004982206  0.2865497 0.017386884 8.463112    49
## [2]  {processed cheese}   => {white bread}        0.004168785  0.2515337 0.016573462 5.975445    41
## [3]  {liquor}             => {bottled beer}       0.004677173  0.4220183 0.011082867 5.240594    46
## [4]  {berries,                                                                                     
##       whole milk}         => {whipped/sour cream} 0.004270463  0.3620690 0.011794611 5.050990    42
## [5]  {herbs,                                                                                       
##       whole milk}         => {root vegetables}    0.004168785  0.5394737 0.007727504 4.949369    41
## [6]  {citrus fruit,                                                                                
##       other vegetables,                                                                            
##       tropical fruit}     => {root vegetables}    0.004473818  0.4943820 0.009049314 4.535678    44
## [7]  {other vegetables,                                                                            
##       root vegetables,                                                                             
##       tropical fruit}     => {citrus fruit}       0.004473818  0.3636364 0.012302999 4.393567    44
## [8]  {whipped/sour cream,                                                                          
##       yogurt}             => {curd}               0.004575496  0.2205882 0.020742247 4.140239    45
## [9]  {citrus fruit,                                                                                
##       other vegetables,                                                                            
##       root vegetables}    => {tropical fruit}     0.004473818  0.4313725 0.010371124 4.110997    44
## [10] {citrus fruit,                                                                                
##       other vegetables,                                                                            
##       whole milk}         => {root vegetables}    0.005795628  0.4453125 0.013014743 4.085493    57

Finally we get some sensible association rules.