Introduction

It is a method of machine learning that analyzes the occurrence of events together and establish relationships between their data. Many machine learning algorithms work with numerical data and these algorithms tend to be very mathematical (Logistic Regression, Support Vector Machines). However, Association Rule Mining algorithms can work very successfully with categorical data.

As an example of Association Rule Mining, the market basket application is usually given. This process resolves the customers ‘purchasing habits by finding the association between the products in the customers’ purchases. In the light of this information, they can determine their shelf order and increase their sales rates and develop effective sales strategies.

Support: It is the ratio of the number of actions containing an asset to the total number of actions. (A / Number of all actions)
Confidence: It is the ratio of the number of actions involving two assets to one. ((A + B) / A))

Support and trust values are looked at in order to establish a association rule. Minimum support and minimum confidence is required for the rule.

Other measures for the association rule are Lift, Conviction, All-Confidence, Collective Strength and Leverage. (Leverage)

There are two Association Rule Mining algorithms that are widely used.

We will focus on apriori algorithm in this paper.

Apriori Algorithm

It is an algorithm developed to extract the relationship between data in machine learning. The algorithm uses a bottom-up approach, examines one data at a time and seeks a relationship between this data and others.

For example, suppose the above figure is the shopping baskets of customers in a market. When we look at the first table, we see the products received. (1 3 4–2 3 4 etc.) The algorithm first finds the frequency of these products, ie the total number of intakes. (1st product was bought 2 times, 3rd product is 3 times etc.) After finding these values, it gets the minimum support value of the highest frequency (50–3 * 50/100 = 1.5%) and those whose frequency is less than this value are eliminated. By combining the remaining values, the same process is repeated and the table is further reduced. This continues until a relationship is found.

Methodology

About the Data

The dataset contains 9835 transactions by customers shopping for groceries. The data contains 169 unique items. Following applications done by the data owner. The csv file was read transaction by transaction and each transaction was saved as a list. A mapping was created from the unique items in the dataset to integers so that each item corresponded to a unique integer. The entire data was mapped to integers to reduce the storage and computational requirement. A reverse mapping was created from the integers to the item, so that the item names could be written in the final output file.

Read the data

Let’s import the relevant libraries and read our data.

library(arulesViz)
library(arules)
setwd("C:\\Users\\ozgrp\\Desktop\\UW\\USL\\Project03")
trans1<-read.transactions("groceries.csv", format="basket", sep=",", skip=0)

If we inspect our data with “inspect()” function we will see an input as follows (it is pasted here as text since its output is very long)

[9834] {bottled beer,
bottled water,
semi-finished bread,
soda}
[9835] {chicken,
other vegetables,
shopping bags,
tropical fruit,
vinegar}

We can check the size of each basket (first 20 elements are displayed)

head(size(trans1),n=20)
##  [1] 4 3 1 4 4 5 1 5 1 2 5 9 1 3 2 4 1 1 1 1

[925] 2 2 2 6 1 5 15 8 2 1 2 2 8 5 2 2 3 3 4 1 1 4 3 7 19 8 2 3 3 2 1 2 6 3 8 5 3 10 3 1 1 3 [967] 8 5 4 4 9 5 1 5 9 4 3 1 3 17 20 3 7 6 4 2 4 9 6 16 16 1 3 11 2 3 23 1 5 10

Although we can see it on the final two rows of inspect function output, let’s check again the length of our market basket.

length(trans1)
## [1] 9835

We can display head of our basket in list form:

LIST(head(trans1))
## [[1]]
## [1] "citrus fruit"        "margarine"           "ready soups"        
## [4] "semi-finished bread"
## 
## [[2]]
## [1] "coffee"         "tropical fruit" "yogurt"        
## 
## [[3]]
## [1] "whole milk"
## 
## [[4]]
## [1] "cream cheese" "meat spreads" "pip fruit"    "yogurt"      
## 
## [[5]]
## [1] "condensed milk"           "long life bakery product"
## [3] "other vegetables"         "whole milk"              
## 
## [[6]]
## [1] "abrasive cleaner" "butter"           "rice"            
## [4] "whole milk"       "yogurt"

As per seen below the relative and absolute frequencies of each product can be gathered. For example beef is seen 5% of the all baskets and it occurs 516 times.

head(itemFrequency(trans1, type="relative"),n=20)
## abrasive cleaner artif. sweetener   baby cosmetics        baby food 
##     0.0035587189     0.0032536858     0.0006100661     0.0001016777 
##             bags    baking powder bathroom cleaner             beef 
##     0.0004067107     0.0176919166     0.0027452974     0.0524656838 
##          berries        beverages     bottled beer    bottled water 
##     0.0332486019     0.0260294865     0.0805287239     0.1105236401 
##           brandy      brown bread           butter      butter milk 
##     0.0041687850     0.0648703610     0.0554143366     0.0279613625 
##         cake bar          candles            candy      canned beer 
##     0.0132180986     0.0089476360     0.0298932384     0.0776817489
head(itemFrequency(trans1, type="absolute"), n=20)
## abrasive cleaner artif. sweetener   baby cosmetics        baby food 
##               35               32                6                1 
##             bags    baking powder bathroom cleaner             beef 
##                4              174               27              516 
##          berries        beverages     bottled beer    bottled water 
##              327              256              792             1087 
##           brandy      brown bread           butter      butter milk 
##               41              638              545              275 
##         cake bar          candles            candy      canned beer 
##              130               88              294              764

Now let’s display the frequencies in a more visual way.

itemFrequencyPlot(trans1, topN=10, type="absolute", main="Item Frequency") 

itemFrequencyPlot(trans1, topN=10, type="relative", main="Item Frequency")

When we apply apriori rule extraction on 0.2 confidence interval then we get only one rule. Let’s display it.

rules.trans1<-apriori(trans1, parameter=list(supp=0.1, conf=0.2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 983 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [1 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules.by.conf<-sort(rules.trans1, by="confidence", decreasing=TRUE) 
inspect(rules.by.conf)
##     lhs    rhs          support  confidence lift count
## [1] {}  => {whole milk} 0.255516 0.255516   1    2513

Ideally we need more than one rule to offer business decisions or alternative strategies to our case. So, lets extract the rules again with 0.1 confidence.

rules.trans1<-apriori(trans1, parameter=list(supp=0.1, conf=0.1))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 983 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

As per seen above we have 8 rules now. Lets inspect them as ordered by their confidence level. Other types of sorting can also be done by playing the “by” parameter in the function.

rules.by.conf<-sort(rules.trans1, by="confidence", decreasing=TRUE) 
inspect(rules.by.conf)
##     lhs    rhs                support   confidence lift count
## [1] {}  => {whole milk}       0.2555160 0.2555160  1    2513 
## [2] {}  => {other vegetables} 0.1934926 0.1934926  1    1903 
## [3] {}  => {rolls/buns}       0.1839349 0.1839349  1    1809 
## [4] {}  => {soda}             0.1743772 0.1743772  1    1715 
## [5] {}  => {yogurt}           0.1395018 0.1395018  1    1372 
## [6] {}  => {bottled water}    0.1105236 0.1105236  1    1087 
## [7] {}  => {root vegetables}  0.1089985 0.1089985  1    1072 
## [8] {}  => {tropical fruit}   0.1049314 0.1049314  1    1032

Let’s take “yogurt” as an example in this case and inspect people with which items are tend to buy yogurt.

rules.yogurt<-apriori(data=trans1, parameter=list(supp=0.001,conf = 0.08), 
                            appearance=list(default="lhs", rhs="yogurt"), control=list(verbose=F)) 
rules.yogurt.byconf<-sort(rules.yogurt, by="confidence", decreasing=TRUE)
inspect(head(rules.yogurt.byconf))
##     lhs                     rhs          support confidence     lift count
## [1] {butter,                                                              
##      cream cheese,                                                        
##      root vegetables}    => {yogurt} 0.001016777  0.9090909 6.516698    10
## [2] {butter,                                                              
##      sliced cheese,                                                       
##      tropical fruit,                                                      
##      whole milk}         => {yogurt} 0.001016777  0.9090909 6.516698    10
## [3] {cream cheese,                                                        
##      curd,                                                                
##      other vegetables,                                                    
##      whipped/sour cream} => {yogurt} 0.001016777  0.9090909 6.516698    10
## [4] {butter,                                                              
##      other vegetables,                                                    
##      tropical fruit,                                                      
##      white bread}        => {yogurt} 0.001016777  0.9090909 6.516698    10
## [5] {pip fruit,                                                           
##      sausage,                                                             
##      sliced cheese}      => {yogurt} 0.001220132  0.8571429 6.144315    12
## [6] {butter,                                                              
##      curd,                                                                
##      tropical fruit,                                                      
##      whole milk}         => {yogurt} 0.001220132  0.8571429 6.144315    12

As can be seen above if a person bought butter, crea, cheese and root vegetables it is very likely for us to see them adding yogurt in his/her basket as well.

Now lets inspect what that person is likely to add to the basket after adding yogurt.

rules.yogurt<-apriori(data=trans1, parameter=list(supp=0.001,conf = 0.08), 
                          appearance=list(default="rhs",lhs="yogurt"), control=list(verbose=F)) 
rules.yogurt.byconf<-sort(rules.yogurt, by="support", decreasing=FALSE)
inspect(head(rules.yogurt.byconf))
##     lhs         rhs                 support    confidence lift     count
## [1] {yogurt} => {frankfurter}       0.01118454 0.08017493 1.359518 110  
## [2] {yogurt} => {beef}              0.01169293 0.08381924 1.597601 115  
## [3] {yogurt} => {napkins}           0.01230300 0.08819242 1.684218 121  
## [4] {yogurt} => {cream cheese}      0.01240468 0.08892128 2.242412 122  
## [5] {yogurt} => {frozen vegetables} 0.01240468 0.08892128 1.848924 122  
## [6] {yogurt} => {margarine}         0.01423488 0.10204082 1.742312 140

We clearly see that people who have got yogurt will likely to get frankfurter, beef, napkins, cream cheese and frozen vegetables with 80-90 confidence. However, getting a margarine is with 10% confidence. So, it is hard to come with a conclusion that suggests people will buy margarine after they but yogurt.

Lets visualize our rules like on the example in our introduction.

plot(rules.trans1, method="graph", control=list(type="items"))
## Available control parameters (with default values):
## main  =  Graph for 8 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

Optional Calculations For Futher Use

At the moment we can find out the closed frequent itemsets for futher use (those are the itemsets that has no superset with the same support).

trans1.closed<-apriori(trans1, parameter=list(target="closed frequent itemsets",support=0.15))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5    0.15      1
##  maxlen                   target   ext
##      10 closed frequent itemsets FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1475 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.02s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## filtering closed item sets ... done [0.00s].
## writing ... [4 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(trans1.closed)
##     items              support   count
## [1] {soda}             0.1743772 1715 
## [2] {rolls/buns}       0.1839349 1809 
## [3] {other vegetables} 0.1934926 1903 
## [4] {whole milk}       0.2555160 2513

We can check the significance of our rules in a list as well.

is.significant(rules.yogurt, trans1)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [34] FALSE  TRUE  TRUE  TRUE

We can also find out which transaction support our rules within the dataset with the following code chunk. However, I will not run it since it produces long output, I will paste the output here.

>supportingTransactions(rules.yogurt, trans1)
>inspect(supportingTransactions(rules.yogurt, trans1)

15 {300,302,564,622,744,862,865,917,937,1017,1148,1183,1185,1217,1229,1240,1263,1392,1418,1558,1592,1668,1879,1922,2016,
2445,2460,2502,2510,2527,2710,2948,2974,3006,3085,3149,3157,3611,3726,3895,3940,4072,4091,4105,4313,4336,4343,4375,4400,
4431,4450,4681,4826,4960,4994,5107,5316,5402,5559,5721,5786,5822,5910,6190,6211,6224,6282,6312,6512,6565,6634,6726,6852,
6937,7132,7226,7249,7287,7342,7473,7493,7520,7554,7740,7812,7826,7902,7908,8231,8302,8306,8348,8423,8616,8633,8640,8679,
8743,8750,8823,8836,8848,8883,8909,9063,9088,9140,9170,9236,9319,9355,9531,9617,9669,9778}