“Affinity analysis is a data analysis and data mining technique that discovers co-occurrence relationships among activities performed by (or recorded about) specific individuals or groups. In general, this can be applied to any process where agents can be uniquely identified and information about their activities can be recorded. In retail, affinity analysis is used to perform market basket analysis, in which retailers seek to understand the purchase behavior of customers. This information can then be used for purposes of cross-selling and up-selling, in addition to influencing sales promotions, loyalty programs, store design, and discount plans.” (https://en.wikipedia.org/wiki/Affinity_analysis)
Market Basket Analysis is useful in understanding what forces drive people to buy certain items. However, association rules are implemented relatively poorly in R. In particular, visualisation of rules is conventionally limited to the package “arulesViz”. In this paper I will propose another approach based on interactive library designed in JavaScript: D3. I’ll use a package “networkD3” which allowes implementing this library in R. I want to check rules using apriori algorithm in two cases: general one and avocado-oriented (the very supreme vegan food product). The dataset comes from kaggle and involves 7501 transactions. (https://www.kaggle.com/roshansharma/market-basket-optimization)
“The Apriori algorithm was proposed by Agrawal and Srikant in 1994. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation or IP addresses[2]). Other algorithms are designed for finding association rules in data having no transactions (Winepi and Minepi), or having no timestamps (DNA sequencing). Each transaction is seen as a set of items (an itemset). Given a threshold C, the Apriori algorithm identifies the item sets which are subsets of at least C transactions in the database. Apriori uses a”bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. Apriori uses breadth-first search and a Hash tree structure to count candidate item sets efficiently. It generates candidate item sets of length k from item sets of length k-1. Then it prunes the candidates which have an infrequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent k-length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates." (https://en.wikipedia.org/wiki/Apriori_algorithm)
The key qualitative measures for the accuracy of the algorithm are:
First, I’m loading the dataset and checking item frequency. It appears that there several items are bought more often than the rest: mineral water, eggs, spaghetti, french fries, chocolate, green tea and so on.
basket <- read.transactions("basket.csv", format="basket", sep=",", skip=0)
length(basket)
## [1] 7501
itemFrequency(basket, type="relative")
## almonds antioxydant juice asparagus
## 0.0203972804 0.0089321424 0.0047993601
## avocado babies food bacon
## 0.0333288895 0.0045327290 0.0086655113
## barbecue sauce black tea blueberries
## 0.0107985602 0.0142647647 0.0091987735
## body spray bramble brownies
## 0.0114651380 0.0018664178 0.0337288362
## bug spray burger sauce burgers
## 0.0086655113 0.0058658845 0.0871883749
## butter cake candy bars
## 0.0301293161 0.0810558592 0.0097320357
## carrots cauliflower cereals
## 0.0153312892 0.0047993601 0.0257299027
## champagne chicken chili
## 0.0467937608 0.0599920011 0.0061325157
## chocolate chocolate bread chutney
## 0.1638448207 0.0042660979 0.0041327823
## cider clothes accessories cookies
## 0.0105319291 0.0083988801 0.0803892814
## cooking oil corn cottage cheese
## 0.0510598587 0.0047993601 0.0318624183
## cream dessert wine eggplant
## 0.0009332089 0.0043994134 0.0131982402
## eggs energy bar energy drink
## 0.1797093721 0.0270630583 0.0266631116
## escalope extra dark chocolate flax seed
## 0.0793227570 0.0119984002 0.0090654579
## french fries french wine fresh bread
## 0.1709105453 0.0225303293 0.0430609252
## fresh tuna fromage blanc frozen smoothie
## 0.0222636982 0.0135981869 0.0633248900
## frozen vegetables gluten free bar grated cheese
## 0.0953206239 0.0069324090 0.0523930143
## green beans green grapes green tea
## 0.0086655113 0.0090654579 0.1321157179
## ground beef gums ham
## 0.0982535662 0.0134648714 0.0265297960
## hand protein bar herb & pepper honey
## 0.0051993068 0.0494600720 0.0474603386
## hot dogs ketchup light cream
## 0.0323956806 0.0043994134 0.0155979203
## light mayo low fat yogurt magazines
## 0.0271963738 0.0765231302 0.0109318757
## mashed potato mayonnaise meatballs
## 0.0041327823 0.0061325157 0.0209305426
## melons milk mineral water
## 0.0119984002 0.1295827223 0.2383682176
## mint mint green tea muffins
## 0.0174643381 0.0055992534 0.0241301160
## mushroom cream sauce napkins nonfat milk
## 0.0190641248 0.0006665778 0.0103986135
## oatmeal oil olive oil
## 0.0043994134 0.0230635915 0.0658578856
## pancakes parmesan cheese pasta
## 0.0950539928 0.0198640181 0.0157312358
## pepper pet food pickles
## 0.0265297960 0.0065324623 0.0059992001
## protein bar red wine rice
## 0.0185308626 0.0281295827 0.0187974937
## salad salmon salt
## 0.0049326756 0.0425276630 0.0091987735
## sandwich shallot shampoo
## 0.0045327290 0.0077323024 0.0049326756
## shrimp soda soup
## 0.0714571390 0.0062658312 0.0505265965
## spaghetti sparkling water spinach
## 0.1741101187 0.0062658312 0.0070657246
## strawberries strong cheese tea
## 0.0213304893 0.0077323024 0.0038661512
## tomato juice tomato sauce tomatoes
## 0.0303959472 0.0141314491 0.0683908812
## toothpaste turkey vegetables mix
## 0.0081322490 0.0625249967 0.0257299027
## water spray white wine whole weat flour
## 0.0003999467 0.0165311292 0.0093320891
## whole wheat pasta whole wheat rice yams
## 0.0294627383 0.0585255299 0.0114651380
## yogurt cake zucchini
## 0.0273296894 0.0094654046
itemFrequencyPlot(basket, topN=30, type="relative", main="Item Frequency")
Then, I’m running the apriori algorithm. It turns out that the algorithm isolated 10 rules only and almost all of them are in relation to mineral water.
rules.basket <- apriori(basket, parameter = list(supp = 0.025, conf = 0.3, minlen=2, maxlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.025 2
## maxlen target ext
## 2 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 187
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [46 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [10 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(rules.basket)
## lhs rhs support confidence lift
## [1] {olive oil} => {mineral water} 0.02759632 0.4190283 1.757904
## [2] {cake} => {mineral water} 0.02746300 0.3388158 1.421397
## [3] {burgers} => {eggs} 0.02879616 0.3302752 1.837830
## [4] {pancakes} => {mineral water} 0.03372884 0.3548387 1.488616
## [5] {frozen vegetables} => {mineral water} 0.03572857 0.3748252 1.572463
## [6] {ground beef} => {spaghetti} 0.03919477 0.3989145 2.291162
## [7] {ground beef} => {mineral water} 0.04092788 0.4165536 1.747522
## [8] {milk} => {mineral water} 0.04799360 0.3703704 1.553774
## [9] {chocolate} => {mineral water} 0.05265965 0.3213995 1.348332
## [10] {spaghetti} => {mineral water} 0.05972537 0.3430322 1.439085
## count
## [1] 207
## [2] 206
## [3] 216
## [4] 253
## [5] 268
## [6] 294
## [7] 307
## [8] 360
## [9] 395
## [10] 448
Below I present traditional arulesViz plots.
plot(rules.basket, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{ground beef}" "{burgers}" "{olive oil}"
## [4] "{frozen vegetables}" "{milk}" "{pancakes}"
## [7] "{spaghetti}" "{cake}" "{chocolate}"
## Itemsets in Consequent (RHS)
## [1] "{mineral water}" "{eggs}" "{spaghetti}"
plot(rules.basket, measure=c("support","lift"), shading="confidence")
plot(rules.basket, method="grouped")
plot(rules.basket, method="graph")
plot(rules.basket, method="graph", control=list(type="items"))
## Available control parameters (with default values):
## main = Graph for 10 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
plot(rules.basket, method="paracoord", control=list(reorder=TRUE))
It can be seen on multiple charts that the biggest support is between milk, spaghetti, chocolate and water, whilst the highest lift is between ground beef and spaghetti, and ground beef and water. The lift of 2.3 between ground beef and spaghetti means that this bundle is bought at more than double the rate we would expect under the independence of those items. Then, the support of 6% between spaghetti and mineral water means that 6% of all transactions include that products.
Now I want to present two suitable D3 diagrams for this set of rules. This, however, requires some data preparation. The code for that is explained in comments.
# extract association data
# first column is the start point (source)
# second column is the the end point (target)
# third column is the value that should influence the width of the links between source and target (in this case we use lift value)
links<- inspect(rules.basket)[,c(1,3,6)]
## lhs rhs support confidence lift
## [1] {olive oil} => {mineral water} 0.02759632 0.4190283 1.757904
## [2] {cake} => {mineral water} 0.02746300 0.3388158 1.421397
## [3] {burgers} => {eggs} 0.02879616 0.3302752 1.837830
## [4] {pancakes} => {mineral water} 0.03372884 0.3548387 1.488616
## [5] {frozen vegetables} => {mineral water} 0.03572857 0.3748252 1.572463
## [6] {ground beef} => {spaghetti} 0.03919477 0.3989145 2.291162
## [7] {ground beef} => {mineral water} 0.04092788 0.4165536 1.747522
## [8] {milk} => {mineral water} 0.04799360 0.3703704 1.553774
## [9] {chocolate} => {mineral water} 0.05265965 0.3213995 1.348332
## [10] {spaghetti} => {mineral water} 0.05972537 0.3430322 1.439085
## count
## [1] 207
## [2] 206
## [3] 216
## [4] 253
## [5] 268
## [6] 294
## [7] 307
## [8] 360
## [9] 395
## [10] 448
# create vector with unique values of the source and target columns
unique_names <- unique(c(unique(levels(links$rhs)), unique(levels(links$lhs))))
# create data frame containing all unqiue names and ID (starting from 0) for each of them
nodes <- data.frame(name = unique_names, id=(0:(length(unique_names)-1)))
# now we need to exchange the names of source (lhs) and target (rhs) to their IDs in the links data frame
links <- merge(nodes, links, by.x='name', by.y='lhs')
links$source <- links$id
links <- links[,3:5]
links <- merge(nodes, links, by.x='name', by.y='rhs')
links$target <- links$id
links$value <- links$lift
links <- links[,4:6]
Sankey diagram is a type of flow diagram where the width of the arrows represents the size of the flow (in my case lift). One of the most well-know representations of this kind of diagram is a chart of Napoleon’s invasion of Russia by Charles Minard. (https://en.wikipedia.org/wiki/Sankey_diagram#/media/File:Minard.png)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name", colourScale = JS("d3.scaleOrdinal(d3.schemeCategory20);"))
It can be seen that in this basket, it’s all about water. The widest arrow is between ground beef and spaghetti (meatballs?), which corresponds to the relatively high value seen in the table of apriori output above. All of the other relations are of comparable lift - somewhere between 1.4 and 1.8. The second non-water-related relationship is between burgers and eggs which makes sense - burgers are usually prepared using an egg. There is a second applicable chart from D3 library: force network chart presented below.
forceNetwork(Links = links, Nodes = nodes,
Source = "source", Target = "target",
Value = "value", NodeID = "id",
Group = "name", width = 550, height = 400,
opacity = 0.9, zoom = TRUE)
Sadly, this kind of diagram does only diplay numeric order of rules (when clicking at the dot). However, a corresponding rule can be easily checked in the above presented apriori-output table. The brigth blue dot in the very center of the concentric diagram is of course water. The dots around are olive oil, cake, pancakes, frozen vegetable, ground beef, spaghetti, milk and chocolate - items frequently both with water. We can see that ground beef and spaghetti are concted too (pink and orange dot). Then, there is a seperate relation not connected to water: burgers and eggs. See that those D3 charts are highly interactive in comparison to the charts from arulesViz package.
Now, I want to inspect the occurance of my very favourite vegan, millenial product - avocado. I’m doing that in a manner similar to the general case.
rules.avocado<-apriori(data=basket, parameter=list(supp=0.001,conf = 0.08),
appearance=list(default="lhs", rhs="avocado"), control=list(verbose=F))
rules.avocado.byconf<-sort(rules.avocado, by="confidence", decreasing=TRUE)
inspect(rules.avocado.byconf)
## lhs rhs support
## [1] {french fries,oil} => {avocado} 0.001066524
## [2] {green tea,soup} => {avocado} 0.001599787
## [3] {milk,spaghetti,tomatoes} => {avocado} 0.001066524
## [4] {cake,turkey} => {avocado} 0.001199840
## [5] {frozen smoothie,olive oil} => {avocado} 0.001199840
## [6] {burgers,milk,spaghetti} => {avocado} 0.001066524
## [7] {pancakes,tomatoes} => {avocado} 0.001466471
## [8] {milk,olive oil,spaghetti} => {avocado} 0.001066524
## [9] {french fries,soup} => {avocado} 0.001066524
## [10] {cottage cheese,mineral water} => {avocado} 0.001333156
## [11] {fresh tuna,mineral water} => {avocado} 0.001199840
## [12] {chocolate,milk,spaghetti} => {avocado} 0.001466471
## [13] {frozen smoothie,frozen vegetables} => {avocado} 0.001333156
## [14] {toothpaste} => {avocado} 0.001066524
## [15] {frozen vegetables,milk,spaghetti} => {avocado} 0.001066524
## [16] {milk,mineral water,olive oil} => {avocado} 0.001066524
## [17] {frozen vegetables,turkey} => {avocado} 0.001066524
## [18] {frozen smoothie,green tea} => {avocado} 0.001333156
## [19] {burgers,spaghetti} => {avocado} 0.002532996
## [20] {frozen vegetables,whole wheat rice} => {avocado} 0.001066524
## [21] {cake,frozen vegetables} => {avocado} 0.001199840
## [22] {french fries,olive oil} => {avocado} 0.001066524
## [23] {melons} => {avocado} 0.001333156
## [24] {milk,mineral water,spaghetti} => {avocado} 0.001733102
## [25] {french fries,frozen smoothie} => {avocado} 0.001599787
## [26] {chocolate,frozen smoothie} => {avocado} 0.001599787
## [27] {milk,soup} => {avocado} 0.001599787
## [28] {milk,tomatoes} => {avocado} 0.001466471
## [29] {yams} => {avocado} 0.001199840
## [30] {burgers,cake} => {avocado} 0.001199840
## [31] {frozen smoothie,milk} => {avocado} 0.001466471
## [32] {frozen smoothie,spaghetti} => {avocado} 0.001599787
## [33] {french fries,whole wheat rice} => {avocado} 0.001066524
## [34] {milk,whole wheat rice} => {avocado} 0.001199840
## [35] {burgers,turkey} => {avocado} 0.001066524
## [36] {frozen smoothie,mineral water} => {avocado} 0.001999733
## [37] {mineral water,soup} => {avocado} 0.002266364
## [38] {milk,spaghetti} => {avocado} 0.003332889
## [39] {milk,olive oil} => {avocado} 0.001599787
## [40] {soup,spaghetti} => {avocado} 0.001333156
## [41] {body spray} => {avocado} 0.001066524
## [42] {honey,spaghetti} => {avocado} 0.001066524
## [43] {fresh tuna} => {avocado} 0.001999733
## [44] {cake,french fries} => {avocado} 0.001599787
## [45] {cake,chocolate} => {avocado} 0.001199840
## [46] {soup} => {avocado} 0.004399413
## [47] {eggs,tomatoes} => {avocado} 0.001066524
## [48] {french fries,mineral water} => {avocado} 0.002932942
## [49] {french fries,pancakes} => {avocado} 0.001733102
## [50] {almonds} => {avocado} 0.001733102
## [51] {spaghetti,whole wheat rice} => {avocado} 0.001199840
## [52] {black tea} => {avocado} 0.001199840
## [53] {french fries,green tea} => {avocado} 0.002399680
## [54] {spaghetti,tomatoes} => {avocado} 0.001733102
## [55] {burgers,mineral water} => {avocado} 0.001999733
## [56] {green tea,pancakes} => {avocado} 0.001333156
## [57] {oil} => {avocado} 0.001866418
## [58] {spaghetti,turkey} => {avocado} 0.001333156
## [59] {milk,pancakes} => {avocado} 0.001333156
## [60] {frozen smoothie} => {avocado} 0.005065991
## [61] {cookies,french fries} => {avocado} 0.001066524
## confidence lift count
## [1] 0.26666667 8.001067 8
## [2] 0.22641509 6.793358 12
## [3] 0.18181818 5.455273 8
## [4] 0.17307692 5.193000 9
## [5] 0.17307692 5.193000 9
## [6] 0.16666667 5.000667 8
## [7] 0.14864865 4.460054 11
## [8] 0.14814815 4.445037 8
## [9] 0.14035088 4.211088 8
## [10] 0.13888889 4.167222 10
## [11] 0.13636364 4.091455 9
## [12] 0.13414634 4.024927 11
## [13] 0.13333333 4.000533 10
## [14] 0.13114754 3.934951 8
## [15] 0.12903226 3.871484 8
## [16] 0.12500000 3.750500 8
## [17] 0.12121212 3.636848 8
## [18] 0.11904762 3.571905 10
## [19] 0.11801242 3.540845 19
## [20] 0.11764706 3.529882 8
## [21] 0.11688312 3.506961 9
## [22] 0.11594203 3.478725 8
## [23] 0.11111111 3.333778 10
## [24] 0.11016949 3.305525 13
## [25] 0.11009174 3.303193 12
## [26] 0.10714286 3.214714 12
## [27] 0.10526316 3.158316 12
## [28] 0.10476190 3.143276 11
## [29] 0.10465116 3.139953 9
## [30] 0.10465116 3.139953 9
## [31] 0.10280374 3.084523 11
## [32] 0.10256410 3.077333 12
## [33] 0.10126582 3.038380 8
## [34] 0.10112360 3.034112 9
## [35] 0.10000000 3.000400 8
## [36] 0.09868421 2.960921 15
## [37] 0.09826590 2.948370 17
## [38] 0.09398496 2.819925 25
## [39] 0.09375000 2.812875 12
## [40] 0.09345794 2.804112 10
## [41] 0.09302326 2.791070 8
## [42] 0.08988764 2.696989 8
## [43] 0.08982036 2.694970 15
## [44] 0.08955224 2.686925 12
## [45] 0.08823529 2.647412 9
## [46] 0.08707124 2.612485 33
## [47] 0.08695652 2.609043 8
## [48] 0.08695652 2.609043 22
## [49] 0.08609272 2.583126 13
## [50] 0.08496732 2.549359 13
## [51] 0.08490566 2.547509 9
## [52] 0.08411215 2.523701 9
## [53] 0.08411215 2.523701 18
## [54] 0.08280255 2.484408 13
## [55] 0.08196721 2.459344 15
## [56] 0.08130081 2.439350 10
## [57] 0.08092486 2.428069 14
## [58] 0.08064516 2.419677 10
## [59] 0.08064516 2.419677 10
## [60] 0.08000000 2.400320 38
## [61] 0.08000000 2.400320 8
It can be seen that the apriori algorithm isolated 61 rules in a setup where avocado is being bought with different products. The highest lift is for rules: french fries, oil –> avocado (8!); green tea, soupp –> avocado; milk, spaghetti, tomatoes –> avocado. This is not a vegan dataset but surprisingly, most of those products bought with avocado most often are vegan. It can be seen that the support for all rules is quite low - mostly around 0,001, the highest value being 0,004 for soup –> avocado. Avocado is not a frequent item in those transactions.
Now, I’m plotting in a standard way. There are a lot of rules this time so I exclude charts which are unreadable.
plot(rules.basket, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{ground beef}" "{burgers}" "{olive oil}"
## [4] "{frozen vegetables}" "{milk}" "{pancakes}"
## [7] "{spaghetti}" "{cake}" "{chocolate}"
## Itemsets in Consequent (RHS)
## [1] "{mineral water}" "{eggs}" "{spaghetti}"
plot(rules.avocado, measure=c("support","lift"), shading="confidence")
plot(rules.avocado, method="paracoord", control=list(reorder=TRUE))
Now, I’m once again preparing data for D3 charts and running them.
links<- inspect(rules.avocado)[,c(1,3,6)]
## lhs rhs support
## [1] {toothpaste} => {avocado} 0.001066524
## [2] {body spray} => {avocado} 0.001066524
## [3] {melons} => {avocado} 0.001333156
## [4] {yams} => {avocado} 0.001199840
## [5] {black tea} => {avocado} 0.001199840
## [6] {almonds} => {avocado} 0.001733102
## [7] {oil} => {avocado} 0.001866418
## [8] {fresh tuna} => {avocado} 0.001999733
## [9] {soup} => {avocado} 0.004399413
## [10] {frozen smoothie} => {avocado} 0.005065991
## [11] {french fries,oil} => {avocado} 0.001066524
## [12] {fresh tuna,mineral water} => {avocado} 0.001199840
## [13] {cottage cheese,mineral water} => {avocado} 0.001333156
## [14] {cookies,french fries} => {avocado} 0.001066524
## [15] {honey,spaghetti} => {avocado} 0.001066524
## [16] {green tea,soup} => {avocado} 0.001599787
## [17] {milk,soup} => {avocado} 0.001599787
## [18] {french fries,soup} => {avocado} 0.001066524
## [19] {soup,spaghetti} => {avocado} 0.001333156
## [20] {mineral water,soup} => {avocado} 0.002266364
## [21] {frozen vegetables,whole wheat rice} => {avocado} 0.001066524
## [22] {milk,whole wheat rice} => {avocado} 0.001199840
## [23] {french fries,whole wheat rice} => {avocado} 0.001066524
## [24] {spaghetti,whole wheat rice} => {avocado} 0.001199840
## [25] {cake,turkey} => {avocado} 0.001199840
## [26] {burgers,turkey} => {avocado} 0.001066524
## [27] {frozen vegetables,turkey} => {avocado} 0.001066524
## [28] {spaghetti,turkey} => {avocado} 0.001333156
## [29] {frozen smoothie,olive oil} => {avocado} 0.001199840
## [30] {frozen smoothie,frozen vegetables} => {avocado} 0.001333156
## [31] {frozen smoothie,green tea} => {avocado} 0.001333156
## [32] {frozen smoothie,milk} => {avocado} 0.001466471
## [33] {french fries,frozen smoothie} => {avocado} 0.001599787
## [34] {chocolate,frozen smoothie} => {avocado} 0.001599787
## [35] {frozen smoothie,spaghetti} => {avocado} 0.001599787
## [36] {frozen smoothie,mineral water} => {avocado} 0.001999733
## [37] {pancakes,tomatoes} => {avocado} 0.001466471
## [38] {milk,tomatoes} => {avocado} 0.001466471
## [39] {eggs,tomatoes} => {avocado} 0.001066524
## [40] {spaghetti,tomatoes} => {avocado} 0.001733102
## [41] {milk,olive oil} => {avocado} 0.001599787
## [42] {french fries,olive oil} => {avocado} 0.001066524
## [43] {burgers,cake} => {avocado} 0.001199840
## [44] {cake,frozen vegetables} => {avocado} 0.001199840
## [45] {cake,french fries} => {avocado} 0.001599787
## [46] {cake,chocolate} => {avocado} 0.001199840
## [47] {burgers,spaghetti} => {avocado} 0.002532996
## [48] {burgers,mineral water} => {avocado} 0.001999733
## [49] {green tea,pancakes} => {avocado} 0.001333156
## [50] {milk,pancakes} => {avocado} 0.001333156
## [51] {french fries,pancakes} => {avocado} 0.001733102
## [52] {french fries,green tea} => {avocado} 0.002399680
## [53] {milk,spaghetti} => {avocado} 0.003332889
## [54] {french fries,mineral water} => {avocado} 0.002932942
## [55] {milk,spaghetti,tomatoes} => {avocado} 0.001066524
## [56] {milk,olive oil,spaghetti} => {avocado} 0.001066524
## [57] {milk,mineral water,olive oil} => {avocado} 0.001066524
## [58] {burgers,milk,spaghetti} => {avocado} 0.001066524
## [59] {frozen vegetables,milk,spaghetti} => {avocado} 0.001066524
## [60] {chocolate,milk,spaghetti} => {avocado} 0.001466471
## [61] {milk,mineral water,spaghetti} => {avocado} 0.001733102
## confidence lift count
## [1] 0.13114754 3.934951 8
## [2] 0.09302326 2.791070 8
## [3] 0.11111111 3.333778 10
## [4] 0.10465116 3.139953 9
## [5] 0.08411215 2.523701 9
## [6] 0.08496732 2.549359 13
## [7] 0.08092486 2.428069 14
## [8] 0.08982036 2.694970 15
## [9] 0.08707124 2.612485 33
## [10] 0.08000000 2.400320 38
## [11] 0.26666667 8.001067 8
## [12] 0.13636364 4.091455 9
## [13] 0.13888889 4.167222 10
## [14] 0.08000000 2.400320 8
## [15] 0.08988764 2.696989 8
## [16] 0.22641509 6.793358 12
## [17] 0.10526316 3.158316 12
## [18] 0.14035088 4.211088 8
## [19] 0.09345794 2.804112 10
## [20] 0.09826590 2.948370 17
## [21] 0.11764706 3.529882 8
## [22] 0.10112360 3.034112 9
## [23] 0.10126582 3.038380 8
## [24] 0.08490566 2.547509 9
## [25] 0.17307692 5.193000 9
## [26] 0.10000000 3.000400 8
## [27] 0.12121212 3.636848 8
## [28] 0.08064516 2.419677 10
## [29] 0.17307692 5.193000 9
## [30] 0.13333333 4.000533 10
## [31] 0.11904762 3.571905 10
## [32] 0.10280374 3.084523 11
## [33] 0.11009174 3.303193 12
## [34] 0.10714286 3.214714 12
## [35] 0.10256410 3.077333 12
## [36] 0.09868421 2.960921 15
## [37] 0.14864865 4.460054 11
## [38] 0.10476190 3.143276 11
## [39] 0.08695652 2.609043 8
## [40] 0.08280255 2.484408 13
## [41] 0.09375000 2.812875 12
## [42] 0.11594203 3.478725 8
## [43] 0.10465116 3.139953 9
## [44] 0.11688312 3.506961 9
## [45] 0.08955224 2.686925 12
## [46] 0.08823529 2.647412 9
## [47] 0.11801242 3.540845 19
## [48] 0.08196721 2.459344 15
## [49] 0.08130081 2.439350 10
## [50] 0.08064516 2.419677 10
## [51] 0.08609272 2.583126 13
## [52] 0.08411215 2.523701 18
## [53] 0.09398496 2.819925 25
## [54] 0.08695652 2.609043 22
## [55] 0.18181818 5.455273 8
## [56] 0.14814815 4.445037 8
## [57] 0.12500000 3.750500 8
## [58] 0.16666667 5.000667 8
## [59] 0.12903226 3.871484 8
## [60] 0.13414634 4.024927 11
## [61] 0.11016949 3.305525 13
unique_names <- unique(c(unique(levels(links$rhs)), unique(levels(links$lhs))))
nodes <- data.frame(name = unique_names, id=(0:(length(unique_names)-1)))
links <- merge(nodes, links, by.x='name', by.y='lhs')
links$source <- links$id
links <- links[,3:5]
links <- merge(nodes, links, by.x='name', by.y='rhs')
links$target <- links$id
links$value <- links$lift
links <- links[,4:6]
forceNetwork(Links = links, Nodes = nodes,
Source = "source", Target = "target",
Value = "value", NodeID = "id",
Group = "name", width = 550, height = 400,
opacity = 0.9, zoom = TRUE)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name", colourScale = JS("d3.scaleOrdinal(d3.schemeCategory20);"))
In the center of the network diagram we have of course avocado. Other dots correspond to different products bought with avocado. A great setback of this diagram is the lack of info about any of the key measures - as in the name, it just shows the network of this products. Then, Sankey diagram was here designed in a way that the width of the arrow reflects the size of lift. However, with 61 transactions this diagram is not easily readible on a small screen. When zooming the chart onto full screen, the width of arrows is distinguishible. We can then see, that indeed, the widest ones correspond to the above mentioned transactions with the highest lift. For example, french fries, oil and avocado are bought in a bundle 8 times more often then if those items were indepenent.
Afinity analysis is a highly interesing and practical topic as it can be applied to investigating human behaviour. This kind of analysis is pretty new (reaching the nineties). Moreover, the implementation in R is said to be preety poor. One of the alternative possibilities to visualise these algorithms is the D3 Java Script package. However, only two graphs from this package are applicable and a key measure can only be implemented to one.