Homework 13: Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

First step is to read in the data and do a prilimary analysis of the data.

data <- read.transactions("GroceryDataSet.csv", format="basket", sep=",")
summary(data)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Using the read.transactions function from the arules package we can see what all the transaction in the CSV look like, the density of the transactions, the most frequent items, etc. I’m curious to know what ‘baby cosmetics’ are.

# Create an item frequency plot for the top 20 items
if (!require("RColorBrewer")) {
  # install color package of R
install.packages("RColorBrewer")
#include library RColorBrewer
library(RColorBrewer)
}
## Loading required package: RColorBrewer
itemFrequencyPlot(data,topN=20,type="absolute",col=brewer.pal(8,'Pastel2'), main="Absolute Item Frequency Plot")

I used the itemFrequencyPlot to create an item Frequency Bar Plot to view the distribution of objects based on itemMatrix.

Next step is to mine the rules using the APRIORI algorithm. The function apriori() is from package arules.

association.rules <- apriori(data, parameter = list(supp=0.001, conf=0.8,minlen=2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(association.rules)
## set of 410 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
##  29 229 140  12 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.329   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence          lift            count      
##  Min.   :0.001017   Min.   :0.8000   Min.   : 3.131   Min.   :10.00  
##  1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.: 3.312   1st Qu.:10.00  
##  Median :0.001220   Median :0.8462   Median : 3.588   Median :12.00  
##  Mean   :0.001247   Mean   :0.8663   Mean   : 3.951   Mean   :12.27  
##  3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.: 4.341   3rd Qu.:13.00  
##  Max.   :0.003152   Max.   :1.0000   Max.   :11.235   Max.   :31.00  
## 
## mining info:
##  data ntransactions support confidence
##  data          9835   0.001        0.8

There are 410 rules generated by this algorithm, I will take a look at the top 10 rules by confidence level.

inspect(association.rules[1:10], by = "confidence")
##      lhs                         rhs                    support confidence      lift count
## [1]  {liquor,                                                                             
##       red/blush wine}         => {bottled beer}     0.001931876  0.9047619 11.235269    19
## [2]  {cereals,                                                                            
##       curd}                   => {whole milk}       0.001016777  0.9090909  3.557863    10
## [3]  {cereals,                                                                            
##       yogurt}                 => {whole milk}       0.001728521  0.8095238  3.168192    17
## [4]  {butter,                                                                             
##       jam}                    => {whole milk}       0.001016777  0.8333333  3.261374    10
## [5]  {bottled beer,                                                                       
##       soups}                  => {whole milk}       0.001118454  0.9166667  3.587512    11
## [6]  {house keeping products,                                                             
##       napkins}                => {whole milk}       0.001321810  0.8125000  3.179840    13
## [7]  {house keeping products,                                                             
##       whipped/sour cream}     => {whole milk}       0.001220132  0.9230769  3.612599    12
## [8]  {pastry,                                                                             
##       sweet spreads}          => {whole milk}       0.001016777  0.9090909  3.557863    10
## [9]  {curd,                                                                               
##       turkey}                 => {other vegetables} 0.001220132  0.8000000  4.134524    12
## [10] {rice,                                                                               
##       sugar}                  => {whole milk}       0.001220132  1.0000000  3.913649    12

To make sure that there are not any duplicate rules, I will remove any redundancies from our associated.rules algorithm.

subset.rules <- which(colSums(is.subset(association.rules, association.rules)) > 1) 
length(subset.rules)
## [1] 91
new_rules <- association.rules[-subset.rules] # remove subset rules.

Now, I will rerun the algorithm and see if there is any change in the rules.

inspect(new_rules[1:10], by = "confidence")
##      lhs                         rhs                    support confidence      lift count
## [1]  {liquor,                                                                             
##       red/blush wine}         => {bottled beer}     0.001931876  0.9047619 11.235269    19
## [2]  {cereals,                                                                            
##       curd}                   => {whole milk}       0.001016777  0.9090909  3.557863    10
## [3]  {cereals,                                                                            
##       yogurt}                 => {whole milk}       0.001728521  0.8095238  3.168192    17
## [4]  {butter,                                                                             
##       jam}                    => {whole milk}       0.001016777  0.8333333  3.261374    10
## [5]  {bottled beer,                                                                       
##       soups}                  => {whole milk}       0.001118454  0.9166667  3.587512    11
## [6]  {house keeping products,                                                             
##       napkins}                => {whole milk}       0.001321810  0.8125000  3.179840    13
## [7]  {house keeping products,                                                             
##       whipped/sour cream}     => {whole milk}       0.001220132  0.9230769  3.612599    12
## [8]  {pastry,                                                                             
##       sweet spreads}          => {whole milk}       0.001016777  0.9090909  3.557863    10
## [9]  {curd,                                                                               
##       turkey}                 => {other vegetables} 0.001220132  0.8000000  4.134524    12
## [10] {rice,                                                                               
##       sugar}                  => {whole milk}       0.001220132  1.0000000  3.913649    12

Looks like there is no change in the top 10 rules. Several of the rules that are associated with whole milk. I would like to target whole milk as bother the lhs and the rms to see if there are any similarities.

milk.association.rules <- apriori(data, parameter = list(supp=0.001, conf=0.8),appearance = list(default="lhs",rhs="whole milk"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [252 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(milk.association.rules)
## set of 252 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
##  18 146  81   7 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.306   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence          lift           count      
##  Min.   :0.001017   Min.   :0.8000   Min.   :3.131   Min.   :10.00  
##  1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.:3.261   1st Qu.:10.00  
##  Median :0.001220   Median :0.8481   Median :3.319   Median :12.00  
##  Mean   :0.001256   Mean   :0.8689   Mean   :3.401   Mean   :12.36  
##  3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.:3.558   3rd Qu.:13.00  
##  Max.   :0.002847   Max.   :1.0000   Max.   :3.914   Max.   :28.00  
## 
## mining info:
##  data ntransactions support confidence
##  data          9835   0.001        0.8
inspect(milk.association.rules[1:10], by = "confidence")
##      lhs                         rhs              support confidence     lift count
## [1]  {cereals,                                                                     
##       curd}                   => {whole milk} 0.001016777  0.9090909 3.557863    10
## [2]  {cereals,                                                                     
##       yogurt}                 => {whole milk} 0.001728521  0.8095238 3.168192    17
## [3]  {butter,                                                                      
##       jam}                    => {whole milk} 0.001016777  0.8333333 3.261374    10
## [4]  {bottled beer,                                                                
##       soups}                  => {whole milk} 0.001118454  0.9166667 3.587512    11
## [5]  {house keeping products,                                                      
##       napkins}                => {whole milk} 0.001321810  0.8125000 3.179840    13
## [6]  {house keeping products,                                                      
##       whipped/sour cream}     => {whole milk} 0.001220132  0.9230769 3.612599    12
## [7]  {pastry,                                                                      
##       sweet spreads}          => {whole milk} 0.001016777  0.9090909 3.557863    10
## [8]  {rice,                                                                        
##       sugar}                  => {whole milk} 0.001220132  1.0000000 3.913649    12
## [9]  {butter,                                                                      
##       rice}                   => {whole milk} 0.001525165  0.8333333 3.261374    15
## [10] {domestic eggs,                                                               
##       rice}                   => {whole milk} 0.001118454  0.8461538 3.311549    11
milk.association.rules2 <- apriori(data, parameter = list(supp=0.001, conf=0.8),appearance = list(lhs="whole milk",default="rhs"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(milk.association.rules2)
## set of 0 rules

It appears that there is only rules when ‘wole milk’ is the right hand side.

Now, let’s graph some of these rules. I will subset the data to only look at the top 10 rules.

top10 <- head(new_rules, n = 10, by = "confidence")
plot(top10, method = "graph",  engine = "htmlwidget")
plot(top10, method="paracoord")

Now, I will do the same witht the milk association rules.

top10milk <- head(milk.association.rules, n = 10, by = "confidence")
plot(top10milk, method = "graph",  engine = "htmlwidget")
plot(top10milk, method="paracoord")

The graph plots show the rules using vertices and edges where vertices are labeled with item names. This is a way to visualize the rules of transactions.

Reference: https://www.datacamp.com/community/tutorials/market-basket-analysis-r