Association rules mining aims to observe frequently occurring patterns, correlations, or associations from datasets. It is a procedure of analyzing data and looking for frequent if/then patterns to discover underlying relationships. The relationships are not causative, only associative. Market Basket Analysis is the first application of this type, where items sold/purchased are analysed to see association between them. The resulting conclusions can be very useful for business decisions.
In this project I analyze the association between different items in grocery at found at this link: (https://www.kaggle.com/roshansharma/market-basket-optimization)
requiredPackages = c("arules","arulesViz","arulesCBA")
for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE) }
Marketdata = read.transactions('market.csv', header = T,sep = ",")
## Warning in asMethod(object): removing duplicated items in transactions
Since the data is large, some less frequent items will be ignored.
dim(Marketdata) # number of transactons x number of items.
## [1] 7500 119
ave_basket_size = mean(size(Marketdata));ave_basket_size # average number of items in a transaction
## [1] 3.911733
Summmary of the dataset.
summary(Marketdata)
## transactions as itemMatrix in sparse format with
## 7500 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03287171
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1787 1348 1306 1282 1229
## (Other)
## 22386
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19
## 1 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.912 5.000 19.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
This is one way to see the most/least frequent items for further steps.
round(itemFrequency(Marketdata, type="relative"),3) # relative frequency rounded to 3 digits
## almonds antioxydant juice asparagus
## 0.020 0.009 0.005
## avocado babies food bacon
## 0.033 0.005 0.009
## barbecue sauce black tea blueberries
## 0.011 0.014 0.009
## body spray bramble brownies
## 0.011 0.002 0.034
## bug spray burger sauce burgers
## 0.009 0.006 0.087
## butter cake candy bars
## 0.030 0.081 0.010
## carrots cauliflower cereals
## 0.015 0.005 0.026
## champagne chicken chili
## 0.047 0.060 0.006
## chocolate chocolate bread chutney
## 0.164 0.004 0.004
## cider clothes accessories cookies
## 0.011 0.008 0.080
## cooking oil corn cottage cheese
## 0.051 0.005 0.032
## cream dessert wine eggplant
## 0.001 0.004 0.013
## eggs energy bar energy drink
## 0.180 0.027 0.027
## escalope extra dark chocolate flax seed
## 0.079 0.012 0.009
## french fries french wine fresh bread
## 0.171 0.023 0.043
## fresh tuna fromage blanc frozen smoothie
## 0.022 0.014 0.063
## frozen vegetables gluten free bar grated cheese
## 0.095 0.007 0.052
## green beans green grapes green tea
## 0.009 0.009 0.132
## ground beef gums ham
## 0.098 0.013 0.027
## hand protein bar herb & pepper honey
## 0.005 0.049 0.047
## hot dogs ketchup light cream
## 0.032 0.004 0.016
## light mayo low fat yogurt magazines
## 0.027 0.076 0.011
## mashed potato mayonnaise meatballs
## 0.004 0.006 0.021
## melons milk mineral water
## 0.012 0.130 0.238
## mint mint green tea muffins
## 0.017 0.006 0.024
## mushroom cream sauce napkins nonfat milk
## 0.019 0.001 0.010
## oatmeal oil olive oil
## 0.004 0.023 0.066
## pancakes parmesan cheese pasta
## 0.095 0.020 0.016
## pepper pet food pickles
## 0.027 0.007 0.006
## protein bar red wine rice
## 0.019 0.028 0.019
## salad salmon salt
## 0.005 0.042 0.009
## sandwich shallot shampoo
## 0.005 0.008 0.005
## shrimp soda soup
## 0.071 0.006 0.051
## spaghetti sparkling water spinach
## 0.174 0.006 0.007
## strawberries strong cheese tea
## 0.021 0.008 0.004
## tomato juice tomato sauce tomatoes
## 0.030 0.014 0.068
## toothpaste turkey vegetables mix
## 0.008 0.063 0.026
## water spray white wine whole weat flour
## 0.000 0.017 0.009
## whole wheat pasta whole wheat rice yams
## 0.029 0.059 0.011
## yogurt cake zucchini
## 0.027 0.009
itemFrequency(Marketdata, type="absolute")
## almonds antioxydant juice asparagus
## 152 66 36
## avocado babies food bacon
## 249 34 65
## barbecue sauce black tea blueberries
## 81 107 69
## body spray bramble brownies
## 86 14 253
## bug spray burger sauce burgers
## 65 44 654
## butter cake candy bars
## 226 608 73
## carrots cauliflower cereals
## 115 36 193
## champagne chicken chili
## 351 450 46
## chocolate chocolate bread chutney
## 1229 32 31
## cider clothes accessories cookies
## 79 63 603
## cooking oil corn cottage cheese
## 383 36 238
## cream dessert wine eggplant
## 7 33 99
## eggs energy bar energy drink
## 1348 203 199
## escalope extra dark chocolate flax seed
## 595 90 68
## french fries french wine fresh bread
## 1282 169 323
## fresh tuna fromage blanc frozen smoothie
## 167 102 474
## frozen vegetables gluten free bar grated cheese
## 715 52 393
## green beans green grapes green tea
## 65 67 990
## ground beef gums ham
## 737 101 199
## hand protein bar herb & pepper honey
## 39 371 355
## hot dogs ketchup light cream
## 243 33 117
## light mayo low fat yogurt magazines
## 204 573 82
## mashed potato mayonnaise meatballs
## 31 46 157
## melons milk mineral water
## 90 972 1787
## mint mint green tea muffins
## 131 42 181
## mushroom cream sauce napkins nonfat milk
## 143 5 78
## oatmeal oil olive oil
## 33 173 493
## pancakes parmesan cheese pasta
## 713 149 118
## pepper pet food pickles
## 199 49 45
## protein bar red wine rice
## 139 211 141
## salad salmon salt
## 36 318 69
## sandwich shallot shampoo
## 34 58 37
## shrimp soda soup
## 535 47 379
## spaghetti sparkling water spinach
## 1306 47 52
## strawberries strong cheese tea
## 160 58 29
## tomato juice tomato sauce tomatoes
## 227 106 513
## toothpaste turkey vegetables mix
## 61 469 192
## water spray white wine whole weat flour
## 3 124 69
## whole wheat pasta whole wheat rice yams
## 221 439 85
## yogurt cake zucchini
## 205 71
This is how the most frequent items compare to each other. Switch between tabs to see in absolute/relative terms.
itemFrequencyPlot(Marketdata, topN=10, type="absolute", main="Item Frequency")
itemFrequencyPlot(Marketdata, topN=10, type="relative", main="Item Frequency")
itemFrequencyPlot(Marketdata, support = 0.1) #minimum support at 10%
Apriori algorithm is an algorithm for frequent item set mining and association rule learning. It identifies the frequent individual items in the database and extends them to larger and larger item sets as long as those item sets appear sufficiently often in the database. Those frequent itemsets are later used for association rules mining.It is commonly applied in market basket analysis like this one. apriori() function of ‘arules’ package in R does this very well.
To simplify the analysis, I consider only those itemsets with atleast the average confidence, support and lift values. Mean confidence = 0.3443, mean support = 0.009624
# generating rules
rules <- apriori(Marketdata, parameter = list(support = 0.009624, confidence = 0.3443, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3443 0.1 1 none FALSE TRUE 5 0.009624 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 72
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7500 transaction(s)] done [0.00s].
## sorting and recoding items ... [76 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [37 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
table(size(rules)) # how many rules in each size category?
##
## 2 3
## 18 19
length(rules) # how many rules
## [1] 37
summary(rules)
## set of 37 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 18 19
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.514 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.009733 Min. :0.3450 Min. :0.02000 Min. :1.448
## 1st Qu.:0.010933 1st Qu.:0.3738 1st Qu.:0.02653 1st Qu.:1.639
## Median :0.014000 Median :0.4048 Median :0.03320 Median :1.781
## Mean :0.017906 Mean :0.4090 Mean :0.04497 Mean :1.830
## 3rd Qu.:0.022800 3rd Qu.:0.4357 3rd Qu.:0.05107 3rd Qu.:1.989
## Max. :0.048000 Max. :0.5067 Max. :0.12960 Max. :2.541
## count
## Min. : 73.0
## 1st Qu.: 82.0
## Median :105.0
## Mean :134.3
## 3rd Qu.:171.0
## Max. :360.0
##
## mining info:
## data ntransactions support confidence
## Marketdata 7500 0.009624 0.3443
inspect(rules[1:10])
## lhs rhs support confidence coverage
## [1] {pepper} => {spaghetti} 0.009866667 0.3718593 0.02653333
## [2] {cereals} => {mineral water} 0.010266667 0.3989637 0.02573333
## [3] {red wine} => {spaghetti} 0.010266667 0.3649289 0.02813333
## [4] {red wine} => {mineral water} 0.010933333 0.3886256 0.02813333
## [5] {avocado} => {mineral water} 0.011466667 0.3453815 0.03320000
## [6] {salmon} => {mineral water} 0.016933333 0.3993711 0.04240000
## [7] {herb & pepper} => {mineral water} 0.017066667 0.3450135 0.04946667
## [8] {soup} => {mineral water} 0.023066667 0.4564644 0.05053333
## [9] {cooking oil} => {mineral water} 0.020133333 0.3942559 0.05106667
## [10] {chicken} => {mineral water} 0.022800000 0.3800000 0.06000000
## lift count
## [1] 2.135486 74
## [2] 1.674442 77
## [3] 2.095687 77
## [4] 1.631053 82
## [5] 1.449559 86
## [6] 1.676152 127
## [7] 1.448014 128
## [8] 1.915771 173
## [9] 1.654683 151
## [10] 1.594852 171
I have chosen the ‘most confident’ 15 rules from the rules above for clear plotting. Using multiple plots can be benefitial.
rules_chosen = sort(rules, by = "confidence")[1:15]
inspect(rules_chosen)
## lhs rhs support confidence
## [1] {eggs,ground beef} => {mineral water} 0.010133333 0.5066667
## [2] {ground beef,milk} => {mineral water} 0.011066667 0.5030303
## [3] {chocolate,ground beef} => {mineral water} 0.010933333 0.4739884
## [4] {frozen vegetables,milk} => {mineral water} 0.011066667 0.4689266
## [5] {soup} => {mineral water} 0.023066667 0.4564644
## [6] {pancakes,spaghetti} => {mineral water} 0.011466667 0.4550265
## [7] {olive oil,spaghetti} => {mineral water} 0.010266667 0.4476744
## [8] {milk,spaghetti} => {mineral water} 0.015733333 0.4436090
## [9] {ground beef,milk} => {spaghetti} 0.009733333 0.4424242
## [10] {chocolate,milk} => {mineral water} 0.014000000 0.4356846
## [11] {ground beef,spaghetti} => {mineral water} 0.017066667 0.4353741
## [12] {frozen vegetables,spaghetti} => {mineral water} 0.012000000 0.4306220
## [13] {chocolate,frozen vegetables} => {mineral water} 0.009733333 0.4244186
## [14] {eggs,milk} => {mineral water} 0.013066667 0.4242424
## [15] {olive oil} => {mineral water} 0.027466667 0.4178499
## coverage lift count
## [1] 0.02000000 2.126469 76
## [2] 0.02200000 2.111207 83
## [3] 0.02306667 1.989319 82
## [4] 0.02360000 1.968075 83
## [5] 0.05053333 1.915771 173
## [6] 0.02520000 1.909736 86
## [7] 0.02293333 1.878880 77
## [8] 0.03546667 1.861817 118
## [9] 0.02200000 2.540721 73
## [10] 0.03213333 1.828559 105
## [11] 0.03920000 1.827256 128
## [12] 0.02786667 1.807311 90
## [13] 0.02293333 1.781276 73
## [14] 0.03080000 1.780536 98
## [15] 0.06573333 1.753707 206
plot(rules_chosen, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{ground beef,milk}" "{eggs,ground beef}"
## [3] "{chocolate,ground beef}" "{frozen vegetables,milk}"
## [5] "{soup}" "{pancakes,spaghetti}"
## [7] "{olive oil,spaghetti}" "{milk,spaghetti}"
## [9] "{chocolate,milk}" "{ground beef,spaghetti}"
## [11] "{frozen vegetables,spaghetti}" "{chocolate,frozen vegetables}"
## [13] "{eggs,milk}" "{olive oil}"
## Itemsets in Consequent (RHS)
## [1] "{mineral water}" "{spaghetti}"
plot(rules_chosen, measure=c("support","lift"), shading="confidence")
plot(rules_chosen, method="grouped")
#### dfgg
plot(rules_chosen, method="graph", control=list(type="items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main = Graph for 15 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
#### Paracoord
plot(rules_chosen, method="paracoord", control=list(reorder=TRUE))
I used confidence as the most desired property. However, support, count or lyft can also be used. The following line of code reorders the rules so that we are able to inspect the rule with the most confidence.
# reorder the rules so that we are able to inspect the most meaningful ones
inspect(sort(rules, by = "confidence")[1:1]) # support, count or lyft can also be used.
## lhs rhs support confidence coverage
## [1] {eggs,ground beef} => {mineral water} 0.01013333 0.5066667 0.02
## lift count
## [1] 2.126469 76
Since this rule involves mineral water which is also one of the most frequent items, I decided to see how it is associated to the other items in the basket.
rules.rootveg<-apriori(data=Marketdata, parameter=list(supp=0.01,conf = 0.005),
appearance=list(default="lhs", rhs="mineral water"), control=list(verbose=F))
rules.rootveg.byconf<-sort(rules.rootveg, by="confidence", decreasing=TRUE)
inspect(head(rules.rootveg.byconf))
## lhs rhs support confidence
## [1] {eggs,ground beef} => {mineral water} 0.01013333 0.5066667
## [2] {ground beef,milk} => {mineral water} 0.01106667 0.5030303
## [3] {chocolate,ground beef} => {mineral water} 0.01093333 0.4739884
## [4] {frozen vegetables,milk} => {mineral water} 0.01106667 0.4689266
## [5] {soup} => {mineral water} 0.02306667 0.4564644
## [6] {pancakes,spaghetti} => {mineral water} 0.01146667 0.4550265
## coverage lift count
## [1] 0.02000000 2.126469 76
## [2] 0.02200000 2.111207 83
## [3] 0.02306667 1.989319 82
## [4] 0.02360000 1.968075 83
## [5] 0.05053333 1.915771 173
## [6] 0.02520000 1.909736 86
rules.rootvegopp<-apriori(data=Marketdata, parameter=list(supp=0.01,conf = 0.005),
appearance=list(default="rhs", lhs="mineral water"), control=list(verbose=F))
rules.rootvegopp.byconf<-sort(rules.rootvegopp, by="confidence", decreasing=TRUE)
inspect(head(rules.rootvegopp.byconf))
## lhs rhs support confidence coverage lift
## [1] {mineral water} => {spaghetti} 0.05973333 0.2506995 0.2382667 1.439698
## [2] {mineral water} => {chocolate} 0.05266667 0.2210409 0.2382667 1.348907
## [3] {mineral water} => {eggs} 0.05093333 0.2137661 0.2382667 1.189351
## [4] {mineral water} => {milk} 0.04800000 0.2014550 0.2382667 1.554436
## [5] {} => {eggs} 0.17973333 0.1797333 1.0000000 1.000000
## [6] {} => {spaghetti} 0.17413333 0.1741333 1.0000000 1.000000
## count
## [1] 448
## [2] 395
## [3] 382
## [4] 360
## [5] 1348
## [6] 1306
I was able to analyze the association rules between different items (focusing on the most frequent). The same procedure can be applied to other dataset wherever the underlying relations are interesting. I found this project very useful and encouraging to learn more.