It is a method of machine learning that analyzes the occurrence of events together and establish relationships between their data. Many machine learning algorithms work with numerical data and these algorithms tend to be very mathematical (Logistic Regression, Support Vector Machines). However, Association Rule Mining algorithms can work very successfully with categorical data.
As an example of Association Rule Mining, the market basket application is usually given. This process resolves the customers ‘purchasing habits by finding the association between the products in the customers’ purchases. In the light of this information, they can determine their shelf order and increase their sales rates and develop effective sales strategies.
Support: It is the ratio of the number of actions containing an asset to the total number of actions. (A / Number of all actions)
Confidence: It is the ratio of the number of actions involving two assets to one. ((A + B) / A))
Support and trust values are looked at in order to establish a association rule. Minimum support and minimum confidence is required for the rule.
Other measures for the association rule are Lift, Conviction, All-Confidence, Collective Strength and Leverage. (Leverage)
There are two Association Rule Mining algorithms that are widely used.
We will focus on apriori algorithm in this paper.
It is an algorithm developed to extract the relationship between data in machine learning. The algorithm uses a bottom-up approach, examines one data at a time and seeks a relationship between this data and others.
For example, suppose the above figure is the shopping baskets of customers in a market. When we look at the first table, we see the products received. (1 3 4–2 3 4 etc.) The algorithm first finds the frequency of these products, ie the total number of intakes. (1st product was bought 2 times, 3rd product is 3 times etc.) After finding these values, it gets the minimum support value of the highest frequency (50–3 * 50/100 = 1.5%) and those whose frequency is less than this value are eliminated. By combining the remaining values, the same process is repeated and the table is further reduced. This continues until a relationship is found.
The dataset contains 9835 transactions by customers shopping for groceries. The data contains 169 unique items. Following applications done by the data owner. The csv file was read transaction by transaction and each transaction was saved as a list. A mapping was created from the unique items in the dataset to integers so that each item corresponded to a unique integer. The entire data was mapped to integers to reduce the storage and computational requirement. A reverse mapping was created from the integers to the item, so that the item names could be written in the final output file.
Let’s import the relevant libraries and read our data.
library(arulesViz)
library(arules)
setwd("C:\\Users\\ozgrp\\Desktop\\UW\\USL\\Project03")
trans1<-read.transactions("groceries.csv", format="basket", sep=",", skip=0)
If we inspect our data with “inspect()” function we will see an input as follows (it is pasted here as text since its output is very long)
[9834] {bottled beer,
bottled water,
semi-finished bread,
soda}
[9835] {chicken,
other vegetables,
shopping bags,
tropical fruit,
vinegar}
We can check the size of each basket (first 20 elements are displayed)
head(size(trans1),n=20)
## [1] 4 3 1 4 4 5 1 5 1 2 5 9 1 3 2 4 1 1 1 1
[925] 2 2 2 6 1 5 15 8 2 1 2 2 8 5 2 2 3 3 4 1 1 4 3 7 19 8 2 3 3 2 1 2 6 3 8 5 3 10 3 1 1 3 [967] 8 5 4 4 9 5 1 5 9 4 3 1 3 17 20 3 7 6 4 2 4 9 6 16 16 1 3 11 2 3 23 1 5 10
Although we can see it on the final two rows of inspect function output, let’s check again the length of our market basket.
length(trans1)
## [1] 9835
We can display head of our basket in list form:
LIST(head(trans1))
## [[1]]
## [1] "citrus fruit" "margarine" "ready soups"
## [4] "semi-finished bread"
##
## [[2]]
## [1] "coffee" "tropical fruit" "yogurt"
##
## [[3]]
## [1] "whole milk"
##
## [[4]]
## [1] "cream cheese" "meat spreads" "pip fruit" "yogurt"
##
## [[5]]
## [1] "condensed milk" "long life bakery product"
## [3] "other vegetables" "whole milk"
##
## [[6]]
## [1] "abrasive cleaner" "butter" "rice"
## [4] "whole milk" "yogurt"
As per seen below the relative and absolute frequencies of each product can be gathered. For example beef is seen 5% of the all baskets and it occurs 516 times.
head(itemFrequency(trans1, type="relative"),n=20)
## abrasive cleaner artif. sweetener baby cosmetics baby food
## 0.0035587189 0.0032536858 0.0006100661 0.0001016777
## bags baking powder bathroom cleaner beef
## 0.0004067107 0.0176919166 0.0027452974 0.0524656838
## berries beverages bottled beer bottled water
## 0.0332486019 0.0260294865 0.0805287239 0.1105236401
## brandy brown bread butter butter milk
## 0.0041687850 0.0648703610 0.0554143366 0.0279613625
## cake bar candles candy canned beer
## 0.0132180986 0.0089476360 0.0298932384 0.0776817489
head(itemFrequency(trans1, type="absolute"), n=20)
## abrasive cleaner artif. sweetener baby cosmetics baby food
## 35 32 6 1
## bags baking powder bathroom cleaner beef
## 4 174 27 516
## berries beverages bottled beer bottled water
## 327 256 792 1087
## brandy brown bread butter butter milk
## 41 638 545 275
## cake bar candles candy canned beer
## 130 88 294 764
Now let’s display the frequencies in a more visual way.
itemFrequencyPlot(trans1, topN=10, type="absolute", main="Item Frequency")
itemFrequencyPlot(trans1, topN=10, type="relative", main="Item Frequency")
When we apply apriori rule extraction on 0.2 confidence interval then we get only one rule. Let’s display it.
rules.trans1<-apriori(trans1, parameter=list(supp=0.1, conf=0.2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 983
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [1 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules.by.conf<-sort(rules.trans1, by="confidence", decreasing=TRUE)
inspect(rules.by.conf)
## lhs rhs support confidence lift count
## [1] {} => {whole milk} 0.255516 0.255516 1 2513
Ideally we need more than one rule to offer business decisions or alternative strategies to our case. So, lets extract the rules again with 0.1 confidence.
rules.trans1<-apriori(trans1, parameter=list(supp=0.1, conf=0.1))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 983
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
As per seen above we have 8 rules now. Lets inspect them as ordered by their confidence level. Other types of sorting can also be done by playing the “by” parameter in the function.
rules.by.conf<-sort(rules.trans1, by="confidence", decreasing=TRUE)
inspect(rules.by.conf)
## lhs rhs support confidence lift count
## [1] {} => {whole milk} 0.2555160 0.2555160 1 2513
## [2] {} => {other vegetables} 0.1934926 0.1934926 1 1903
## [3] {} => {rolls/buns} 0.1839349 0.1839349 1 1809
## [4] {} => {soda} 0.1743772 0.1743772 1 1715
## [5] {} => {yogurt} 0.1395018 0.1395018 1 1372
## [6] {} => {bottled water} 0.1105236 0.1105236 1 1087
## [7] {} => {root vegetables} 0.1089985 0.1089985 1 1072
## [8] {} => {tropical fruit} 0.1049314 0.1049314 1 1032
Let’s take “yogurt” as an example in this case and inspect people with which items are tend to buy yogurt.
rules.yogurt<-apriori(data=trans1, parameter=list(supp=0.001,conf = 0.08),
appearance=list(default="lhs", rhs="yogurt"), control=list(verbose=F))
rules.yogurt.byconf<-sort(rules.yogurt, by="confidence", decreasing=TRUE)
inspect(head(rules.yogurt.byconf))
## lhs rhs support confidence lift count
## [1] {butter,
## cream cheese,
## root vegetables} => {yogurt} 0.001016777 0.9090909 6.516698 10
## [2] {butter,
## sliced cheese,
## tropical fruit,
## whole milk} => {yogurt} 0.001016777 0.9090909 6.516698 10
## [3] {cream cheese,
## curd,
## other vegetables,
## whipped/sour cream} => {yogurt} 0.001016777 0.9090909 6.516698 10
## [4] {butter,
## other vegetables,
## tropical fruit,
## white bread} => {yogurt} 0.001016777 0.9090909 6.516698 10
## [5] {pip fruit,
## sausage,
## sliced cheese} => {yogurt} 0.001220132 0.8571429 6.144315 12
## [6] {butter,
## curd,
## tropical fruit,
## whole milk} => {yogurt} 0.001220132 0.8571429 6.144315 12
As can be seen above if a person bought butter, crea, cheese and root vegetables it is very likely for us to see them adding yogurt in his/her basket as well.
Now lets inspect what that person is likely to add to the basket after adding yogurt.
rules.yogurt<-apriori(data=trans1, parameter=list(supp=0.001,conf = 0.08),
appearance=list(default="rhs",lhs="yogurt"), control=list(verbose=F))
rules.yogurt.byconf<-sort(rules.yogurt, by="support", decreasing=FALSE)
inspect(head(rules.yogurt.byconf))
## lhs rhs support confidence lift count
## [1] {yogurt} => {frankfurter} 0.01118454 0.08017493 1.359518 110
## [2] {yogurt} => {beef} 0.01169293 0.08381924 1.597601 115
## [3] {yogurt} => {napkins} 0.01230300 0.08819242 1.684218 121
## [4] {yogurt} => {cream cheese} 0.01240468 0.08892128 2.242412 122
## [5] {yogurt} => {frozen vegetables} 0.01240468 0.08892128 1.848924 122
## [6] {yogurt} => {margarine} 0.01423488 0.10204082 1.742312 140
We clearly see that people who have got yogurt will likely to get frankfurter, beef, napkins, cream cheese and frozen vegetables with 80-90 confidence. However, getting a margarine is with 10% confidence. So, it is hard to come with a conclusion that suggests people will buy margarine after they but yogurt.
Lets visualize our rules like on the example in our introduction.
plot(rules.trans1, method="graph", control=list(type="items"))
## Available control parameters (with default values):
## main = Graph for 8 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
At the moment we can find out the closed frequent itemsets for futher use (those are the itemsets that has no superset with the same support).
trans1.closed<-apriori(trans1, parameter=list(target="closed frequent itemsets",support=0.15))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.15 1
## maxlen target ext
## 10 closed frequent itemsets FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1475
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.02s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## filtering closed item sets ... done [0.00s].
## writing ... [4 set(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(trans1.closed)
## items support count
## [1] {soda} 0.1743772 1715
## [2] {rolls/buns} 0.1839349 1809
## [3] {other vegetables} 0.1934926 1903
## [4] {whole milk} 0.2555160 2513
We can check the significance of our rules in a list as well.
is.significant(rules.yogurt, trans1)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [34] FALSE TRUE TRUE TRUE
We can also find out which transaction support our rules within the dataset with the following code chunk. However, I will not run it since it produces long output, I will paste the output here.
>supportingTransactions(rules.yogurt, trans1)
>inspect(supportingTransactions(rules.yogurt, trans1)
15 {300,302,564,622,744,862,865,917,937,1017,1148,1183,1185,1217,1229,1240,1263,1392,1418,1558,1592,1668,1879,1922,2016,
2445,2460,2502,2510,2527,2710,2948,2974,3006,3085,3149,3157,3611,3726,3895,3940,4072,4091,4105,4313,4336,4343,4375,4400,
4431,4450,4681,4826,4960,4994,5107,5316,5402,5559,5721,5786,5822,5910,6190,6211,6224,6282,6312,6512,6565,6634,6726,6852,
6937,7132,7226,7249,7287,7342,7473,7493,7520,7554,7740,7812,7826,7902,7908,8231,8302,8306,8348,8423,8616,8633,8640,8679,
8743,8750,8823,8836,8848,8883,8909,9063,9088,9140,9170,9236,9319,9355,9531,9617,9669,9778}