Unsupervised Learning - Association Rules

Daria Ivanushenko

Review of the Data set

Dataset is taken form the dataset repository. It contains information on conducted transactions and items that were purchased. Data set contains 1862 transactions and 12 unique items to purchase.

library(arules)
transactions = read.transactions("transaction_data.csv", sep = ",", header = T, skip = 0, format = "basket")
transactions

## transactions in sparse format with
##  1862 transactions (rows) and
##  12 items (columns)

inspect(head(transactions))

##     items              
## [1] {abrasive cleaner, 
##      coffee,           
##      fish,             
##      frozen meals,     
##      ice cream}        
## [2] {baking powder,    
##      butter,           
##      coffee,           
##      frozen meals,     
##      ice cream}        
## [3] {abrasive cleaner, 
##      butter,           
##      coffee,           
##      frozen vegetables,
##      ice cream}        
## [4] {baking powder,    
##      butter,           
##      frozen meals,     
##      ice cream}        
## [5] {cake bar,         
##      coffee,           
##      frozen meals,     
##      frozen vegetables,
##      ice cream}        
## [6] {abrasive cleaner, 
##      domestic eggs,    
##      fish,             
##      honey}

length(transactions)

## [1] 1862

itemFrequency(transactions, type="absolute")

##  abrasive cleaner     baking powder            butter          cake bar 
##               360               662               839               386 
##            coffee     domestic eggs              fish      frozen meals 
##               605               430               563              1001 
## frozen vegetables            grapes             honey         ice cream 
##               541               445               355               554

itemFrequencyPlot(transactions, topN=12, type="relative", main="Item Frequency")

From the above graph we can say that frozen meals, butter and baking powder creating top 3 purchased products.

Association Rules

We are creating rules in our datasets with Apriori Algorithm.

rules.trans = apriori(transactions, parameter=list(supp=0.1, conf=0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 186 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[12 item(s), 1862 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [31 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules.trans

## set of 31 rules

Here we can see that 31 rules were created.

rules.trans2 = apriori(transactions, parameter=list(supp=0.05, conf=0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.05      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 93 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[12 item(s), 1862 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [70 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules.trans2

## set of 70 rules

We can imply that the lower the support the higher number of rules to analyze we have to.I deciedd to stick with 31 rules.

rules.by.conf = sort(rules.trans, by="confidence", decreasing=TRUE)
inspect(head(rules.by.conf))

##     lhs                              rhs            support   confidence
## [1] {coffee}                      => {frozen meals} 0.3114930 0.9586777 
## [2] {baking powder,coffee}        => {frozen meals} 0.1331901 0.9575290 
## [3] {baking powder,butter,coffee} => {frozen meals} 0.1015038 0.9450000 
## [4] {coffee,fish}                 => {frozen meals} 0.1143931 0.9424779 
## [5] {butter,coffee}               => {frozen meals} 0.1552095 0.9322581 
## [6] {baking powder,coffee}        => {butter}       0.1074114 0.7722008 
##     coverage  lift     count
## [1] 0.3249194 1.783275 580  
## [2] 0.1390977 1.781138 248  
## [3] 0.1074114 1.757832 189  
## [4] 0.1213749 1.753141 213  
## [5] 0.1664876 1.734130 289  
## [6] 0.1390977 1.713752 200

Confidence is telling us the frequency of appearing item rhs in customer basket given the fact that item lhs is already there. The highest value confidence can take is 1 means that customer will always purchase item rhs together with item lhs. We can say that most probably customer will buy frozen meals having coffee, baking powder, fish and butter.

rules.by.lift = sort(rules.trans, by="lift", decreasing=TRUE) 
inspect(head(rules.by.lift))

##     lhs                                    rhs             support   confidence
## [1] {fish,frozen meals}                 => {coffee}        0.1143931 0.6339286 
## [2] {baking powder,butter,frozen meals} => {coffee}        0.1015038 0.6096774 
## [3] {baking powder,frozen meals}        => {coffee}        0.1331901 0.6063570 
## [4] {butter,coffee,frozen meals}        => {baking powder} 0.1015038 0.6539792 
## [5] {butter,coffee}                     => {baking powder} 0.1074114 0.6451613 
## [6] {coffee}                            => {frozen meals}  0.3114930 0.9586777 
##     coverage  lift     count
## [1] 0.1804511 1.951033 213  
## [2] 0.1664876 1.876396 189  
## [3] 0.2196563 1.866176 248  
## [4] 0.1552095 1.839440 189  
## [5] 0.1664876 1.814638 200  
## [6] 0.3249194 1.783275 580

Lift is telling us about the frequency of purchasing product lhs and rhs together that separately. From the output we can say that fish, frozen meals and coffee are twice (lift is equal for 1.951) likely to be purchased together than separately. Value of lift higher than 1 implies that products are more likely to be purchased together that separately, value of lift less that 1 means that products are more likely to be bought separately.

rules.by.count = sort(rules.trans, by="count", decreasing=TRUE) 
inspect(head(rules.by.count))

##     lhs                rhs            support   confidence coverage  lift    
## [1] {}              => {frozen meals} 0.5375940 0.5375940  1.0000000 1.000000
## [2] {coffee}        => {frozen meals} 0.3114930 0.9586777  0.3249194 1.783275
## [3] {frozen meals}  => {coffee}       0.3114930 0.5794206  0.5375940 1.783275
## [4] {butter}        => {frozen meals} 0.2696026 0.5983313  0.4505908 1.112980
## [5] {frozen meals}  => {butter}       0.2696026 0.5014985  0.5375940 1.112980
## [6] {baking powder} => {butter}       0.2658432 0.7477341  0.3555317 1.659453
##     count
## [1] 1001 
## [2]  580 
## [3]  580 
## [4]  502 
## [5]  502 
## [6]  495

Count shows the number of transactions for each fo the combinations of the items. The most purchased are frozen meals and combination of frozen meals and coffee.

rules.by.supp = sort(rules.trans, by="support", decreasing=TRUE) 
inspect(head(rules.by.supp))

##     lhs                rhs            support   confidence coverage  lift    
## [1] {}              => {frozen meals} 0.5375940 0.5375940  1.0000000 1.000000
## [2] {coffee}        => {frozen meals} 0.3114930 0.9586777  0.3249194 1.783275
## [3] {frozen meals}  => {coffee}       0.3114930 0.5794206  0.5375940 1.783275
## [4] {butter}        => {frozen meals} 0.2696026 0.5983313  0.4505908 1.112980
## [5] {frozen meals}  => {butter}       0.2696026 0.5014985  0.5375940 1.112980
## [6] {baking powder} => {butter}       0.2658432 0.7477341  0.3555317 1.659453
##     count
## [1] 1001 
## [2]  580 
## [3]  580 
## [4]  502 
## [5]  502 
## [6]  495

Support is telling us how many times certain set of items appeared in out dataset. The highest support is around 54% for frozen meals meaning that only frozen meals were bought around 1000 times. We can notice that frozen meals appears in all transactions with the highest score of the support.

Graphical Ilustration

With the help of arulesViz package I will produce graphical illustration of the rules. I decided to create a smaller sample of rules with top 5 rules for the highest lift and provided graphical analysis. Idea of cutting the sample provided easier understanding of the graphs.

rules_5 <- head(rules.trans, n = 5, by = "lift")

library(arulesViz)
plot(rules_5, method="graph")

This graph illustrates rules we produced with apriori algorithm with support equal to 10%. Graphs shows the relationships of the items in our dataset. Size of the circle represents support and color shows the lift.

plot(rules.trans, method = "paracoord")

Parallel coordinates plots is another variation of the graphical representation of the association rules. Color intensity is confidence and width of the arrow is support. Source.

Another great way of plotting association rules was found at Data Camp. Its an interactive version of previous plots and I think its great for plotting big numbers of rules. You can use dropdown list to filter be association rule, item name and item id. Also, you can zoom in the plot which helps to understand better the relationships.

plot(rules.trans, method = "graph",  engine = "htmlwidget")