Association rules is studied in this analysis. Association rules is a machine learning method and it aims to find “If this, then that” relations between purchased products. In other words, Association rules find the relationships between set of items for each distinct transaction and “If this” part called as antecedent and “then that” part called as consequent. In this study, market basket analysis is studied which is one of the most popular association rules approach. At first, set of transactions are extracted in order to find rules and under favour of rules, occurrence of an item can be predicted based on occurrences of other items in the transaction. Moreover, in this study, there are 3 rule evaluation metrics which are support, confidence and lift.
The data is collected from a grocery store and each transaction represents the items that purchased for one basket in one time. The type of data is transactional data and it consists of 7501 transactions and 20 columns.
In here, market basket analysis done step by step. In the case of extremely long outputs, the important part of the outputs has been shown.
The libraries shown below are used in this study:
library(arulesViz)
library(arules)
library(dplyr)
Also, the dataset imported in R as it is shown.
mbo<-read.transactions("Market_Basket_Optimisation.csv", format="basket", sep=",", skip=0)
The details about the dataset before the analysis are shown below.
inspect(head(mbo))
## items
## [1] {almonds,
## antioxydant juice,
## avocado,
## cottage cheese,
## energy drink,
## frozen smoothie,
## green grapes,
## green tea,
## honey,
## low fat yogurt,
## mineral water,
## olive oil,
## salad,
## salmon,
## shrimp,
## spinach,
## tomato juice,
## vegetables mix,
## whole weat flour,
## yams}
## [2] {burgers,
## eggs,
## meatballs}
## [3] {chutney}
## [4] {avocado,
## turkey}
## [5] {energy bar,
## green tea,
## milk,
## mineral water,
## whole wheat rice}
## [6] {low fat yogurt}
summary(mbo)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19 20
## 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
length(mbo)
## [1] 7501
At this stage all items which included the analysis can be observed with number of appearances. According to output, the most frequent item is the mineral water with the 1788 times appearance, meanwhile water spray is the least frequent item with 3 times appearance.
itemFrequency(mbo, type="absolute")
## almonds antioxydant juice asparagus
## 153 67 36
## avocado babies food bacon
## 250 34 65
## barbecue sauce black tea blueberries
## 81 107 69
## body spray bramble brownies
## 86 14 253
## bug spray burger sauce burgers
## 65 44 654
## butter cake candy bars
## 226 608 73
## carrots cauliflower cereals
## 115 36 193
## champagne chicken chili
## 351 450 46
## chocolate chocolate bread chutney
## 1229 32 31
## cider clothes accessories cookies
## 79 63 603
## cooking oil corn cottage cheese
## 383 36 239
## cream dessert wine eggplant
## 7 33 99
## eggs energy bar energy drink
## 1348 203 200
## escalope extra dark chocolate flax seed
## 595 90 68
## french fries french wine fresh bread
## 1282 169 323
## fresh tuna fromage blanc frozen smoothie
## 167 102 475
## frozen vegetables gluten free bar grated cheese
## 715 52 393
## green beans green grapes green tea
## 65 68 991
## ground beef gums ham
## 737 101 199
## hand protein bar herb & pepper honey
## 39 371 356
## hot dogs ketchup light cream
## 243 33 117
## light mayo low fat yogurt magazines
## 204 574 82
## mashed potato mayonnaise meatballs
## 31 46 157
## melons milk mineral water
## 90 972 1788
## mint mint green tea muffins
## 131 42 181
## mushroom cream sauce napkins nonfat milk
## 143 5 78
## oatmeal oil olive oil
## 33 173 494
## pancakes parmesan cheese pasta
## 713 149 118
## pepper pet food pickles
## 199 49 45
## protein bar red wine rice
## 139 211 141
## salad salmon salt
## 37 319 69
## sandwich shallot shampoo
## 34 58 37
## shrimp soda soup
## 536 47 379
## spaghetti sparkling water spinach
## 1306 47 53
## strawberries strong cheese tea
## 160 58 29
## tomato juice tomato sauce tomatoes
## 228 106 513
## toothpaste turkey vegetables mix
## 61 469 193
## water spray white wine whole weat flour
## 3 124 70
## whole wheat pasta whole wheat rice yams
## 221 439 86
## yogurt cake zucchini
## 205 71
itemFrequencyPlot(mbo, topN=10, type="absolute", main="Item Frequency")
The histogram above represents, first 10 item frequency on market basket analysis and as can be clearly seen mineral water is the most frequent item.
At this stage apriori algorithm is applied according to obtain support, confidence and lift values. Then all values are sorted
rules.mbo<-apriori(mbo, parameter=list(supp=0.1, conf=0.1))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 750
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
As can be seen on above, 7 rules are generated by apriori algorithm. Now all measures are sorted respectively by descending order below.
rules.by.count<- sort(rules.mbo, by="count", decreasing=TRUE)
inspect(rules.by.count)
## lhs rhs support confidence lift count
## [1] {} => {mineral water} 0.2383682 0.2383682 1 1788
## [2] {} => {eggs} 0.1797094 0.1797094 1 1348
## [3] {} => {spaghetti} 0.1741101 0.1741101 1 1306
## [4] {} => {french fries} 0.1709105 0.1709105 1 1282
## [5] {} => {chocolate} 0.1638448 0.1638448 1 1229
## [6] {} => {green tea} 0.1321157 0.1321157 1 991
## [7] {} => {milk} 0.1295827 0.1295827 1 972
rules.by.supp<-sort(rules.mbo, by = "support", decreasing=TRUE)
inspect(rules.by.supp)
## lhs rhs support confidence lift count
## [1] {} => {mineral water} 0.2383682 0.2383682 1 1788
## [2] {} => {eggs} 0.1797094 0.1797094 1 1348
## [3] {} => {spaghetti} 0.1741101 0.1741101 1 1306
## [4] {} => {french fries} 0.1709105 0.1709105 1 1282
## [5] {} => {chocolate} 0.1638448 0.1638448 1 1229
## [6] {} => {green tea} 0.1321157 0.1321157 1 991
## [7] {} => {milk} 0.1295827 0.1295827 1 972
rules.by.conf <- sort(rules.mbo, by = "confidence", decreasing=TRUE)
inspect(rules.by.conf)
## lhs rhs support confidence lift count
## [1] {} => {mineral water} 0.2383682 0.2383682 1 1788
## [2] {} => {eggs} 0.1797094 0.1797094 1 1348
## [3] {} => {spaghetti} 0.1741101 0.1741101 1 1306
## [4] {} => {french fries} 0.1709105 0.1709105 1 1282
## [5] {} => {chocolate} 0.1638448 0.1638448 1 1229
## [6] {} => {green tea} 0.1321157 0.1321157 1 991
## [7] {} => {milk} 0.1295827 0.1295827 1 972
rules.by.lift<-sort(rules.mbo, by = "lift", decreasing=TRUE)
inspect(rules.by.lift)
## lhs rhs support confidence lift count
## [1] {} => {green tea} 0.1321157 0.1321157 1 991
## [2] {} => {french fries} 0.1709105 0.1709105 1 1282
## [3] {} => {chocolate} 0.1638448 0.1638448 1 1229
## [4] {} => {eggs} 0.1797094 0.1797094 1 1348
## [5] {} => {spaghetti} 0.1741101 0.1741101 1 1306
## [6] {} => {mineral water} 0.2383682 0.2383682 1 1788
## [7] {} => {milk} 0.1295827 0.1295827 1 972
Mineral water is situated for all tables except the table that sorted by lift. Now the code is analyzed which type of transactions lead to mineral water.
mbo.sel<-mbo[,itemFrequency(mbo)>0.05]
rules.mw<-apriori(data=mbo, parameter=list(supp=0.001,conf = 0.08),
appearance=list(default="lhs", rhs="mineral water"), control=list(verbose=F))
rules.mw.byconf<-sort(rules.mw, by="confidence", decreasing=TRUE)
inspect(head(rules.mw.byconf))
## lhs rhs support confidence lift count
## [1] {ground beef,
## light cream,
## olive oil} => {mineral water} 0.001199840 1.0000000 4.195190 9
## [2] {cake,
## olive oil,
## shrimp} => {mineral water} 0.001199840 1.0000000 4.195190 9
## [3] {red wine,
## soup} => {mineral water} 0.001866418 0.9333333 3.915511 14
## [4] {ground beef,
## pancakes,
## whole wheat rice} => {mineral water} 0.001333156 0.9090909 3.813809 10
## [5] {frozen vegetables,
## milk,
## spaghetti,
## turkey} => {mineral water} 0.001199840 0.9000000 3.775671 9
## [6] {chocolate,
## frozen vegetables,
## olive oil,
## shrimp} => {mineral water} 0.001199840 0.9000000 3.775671 9
Now the opposite situation of above is analyzed which a customer buys at first mineral water.
rules.mw<-apriori(data=mbo, parameter=list(supp=0.001,conf = 0.08),
appearance=list(default="rhs",lhs="mineral water"), control=list(verbose=F))
rules.mw.byconf<-sort(rules.mw, by="support", decreasing=FALSE)
inspect(head(rules.mw.byconf))
## lhs rhs support confidence lift count
## [1] {mineral water} => {turkey} 0.01919744 0.08053691 1.288075 144
## [2] {mineral water} => {cooking oil} 0.02013065 0.08445190 1.653978 151
## [3] {mineral water} => {whole wheat rice} 0.02013065 0.08445190 1.442993 151
## [4] {mineral water} => {frozen smoothie} 0.02026396 0.08501119 1.342461 152
## [5] {mineral water} => {chicken} 0.02279696 0.09563758 1.594172 171
## [6] {mineral water} => {soup} 0.02306359 0.09675615 1.914955 173
A person receiving mineral water is likely to have above, turkey, cooking oil, whole wheat rice, frozen smoothie, chicken and soup in his/her basket
Here the plots are generated for rules of market basket optimization dataset.
plot(rules.mbo) #graph belonging to rules
plot(rules.mbo, measure=c("support","lift"), shading="confidence")
plot(rules.mbo, shading="order", control=list(main="Two-key plot"))
plot(rules.mbo, method="graph")
plot(rules.mbo, method="graph", control=list(type="items"))
## Available control parameters (with default values):
## main = Graph for 7 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
Now closed frequent items are searched by the help of apriori algorithm.
mbo.closed<-apriori(mbo, parameter=list(target="closed frequent itemsets",support=0.15))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.15 1
## maxlen target ext
## 10 closed frequent itemsets FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1125
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## filtering closed item sets ... done [0.00s].
## writing ... [5 set(s)] done [0.00s].
## creating S4 object ... done [0.00s].
mbo.closed
## set of 5 itemsets
inspect(mbo.closed)
## items support count
## [1] {french fries} 0.1709105 1282
## [2] {chocolate} 0.1638448 1229
## [3] {eggs} 0.1797094 1348
## [4] {spaghetti} 0.1741101 1306
## [5] {mineral water} 0.2383682 1788
class(mbo.closed)
## [1] "itemsets"
## attr(,"package")
## [1] "arules"
For checking the significance of the algorithm is. signifianct is used.
is.significant(rules.mbo, mbo)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Maximal itemset is reached with is.maximal.
is.maximal(rules.mbo)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
is.redundant(rules.mbo) #finding redundant rules
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
inspect(rules.mbo[is.redundant(rules.mbo)==FALSE])
## lhs rhs support confidence lift count
## [1] {} => {green tea} 0.1321157 0.1321157 1 991
## [2] {} => {milk} 0.1295827 0.1295827 1 972
## [3] {} => {french fries} 0.1709105 0.1709105 1 1282
## [4] {} => {chocolate} 0.1638448 0.1638448 1 1229
## [5] {} => {eggs} 0.1797094 0.1797094 1 1348
## [6] {} => {spaghetti} 0.1741101 0.1741101 1 1306
## [7] {} => {mineral water} 0.2383682 0.2383682 1 1788
At this stage, supersets and subsets are shown below.
is.superset(rules.mbo) #finds supersets
## 7 x 7 sparse Matrix of class "ngCMatrix"
## {green tea} {milk} {french fries} {chocolate} {eggs}
## {green tea} | . . . .
## {milk} . | . . .
## {french fries} . . | . .
## {chocolate} . . . | .
## {eggs} . . . . |
## {spaghetti} . . . . .
## {mineral water} . . . . .
## {spaghetti} {mineral water}
## {green tea} . .
## {milk} . .
## {french fries} . .
## {chocolate} . .
## {eggs} . .
## {spaghetti} | .
## {mineral water} . |
is.subset(rules.mbo) # finds subsets
## 7 x 7 sparse Matrix of class "ngCMatrix"
## {green tea} {milk} {french fries} {chocolate} {eggs}
## {green tea} | . . . .
## {milk} . | . . .
## {french fries} . . | . .
## {chocolate} . . . | .
## {eggs} . . . . |
## {spaghetti} . . . . .
## {mineral water} . . . . .
## {spaghetti} {mineral water}
## {green tea} . .
## {milk} . .
## {french fries} . .
## {chocolate} . .
## {eggs} . .
## {spaghetti} | .
## {mineral water} . |
supportingTransactions(rules.mbo, mbo)
## tidLists in sparse format with
## 7 items/itemsets (rows) and
## 7501 transactions (columns)
In conclusion, the market basket analysis is studied in this analysis and it is one of the most popular association rules approach. In this study, “market basket optimization” dataset is analyzed, and results were obtained as follows: There are 7501 transactions and 119 different items. After, the necessary information’s are observed related with the dataset by using summary method. “arules” and “arulesViz” packages are mainly used in the analysis. Then, set of transactions are determined and rules for these transactions are analyzed. In this case, 7 rules are obtained. Moreover, support, confidence, lift and set of rules are found. After this step, all outputs were sorted for each method. According to the results, mineral water is the most frequent item. Then, which type of transaction lead to mineral water and the opposite situation of this are analyzed. The results are plotted and then the analysis is tested. Lastly, subset and supersets are obtained.