Association Rules and Market Basket Analysis are often used by shops - this part of unsupervised learning helps sellers know patterns and bindings of clients behaviors. Thanks to them sellers are able to predict which products are linked - if somebody buys X then he or she tends to buy Y. I want to use it to analyze patterns in the musical tastes of listeners - diverse in terms of country and gender. Just like Spotify or other streaming services recommendation system. Firstly, we need to load libraries:
library(dplyr)
library(arules)
library(arulesViz)
In this paper, I am using database from Kaggle- free and open source of data sets. We have 15 000 users and their favourite artists - maximum number of artists for one user is 76. Let’s load the data:
artists <- read.transactions("data.csv", format = "basket", sep = ";", skip = 1)
And short summary of database:
summary(artists)
## transactions as itemMatrix in sparse format with
## 15000 rows (elements/itemsets/transactions) and
## 1004 columns (items) and a density of 0.01925319
##
## most frequent items:
## radiohead the beatles coldplay
## 2704 2668 2378
## red hot chili peppers muse (Other)
## 1786 1711 278706
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 185 222 280 302 359 385 472 461 491 501 504 482 472 471 479 477 456 455 444 455
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 436 478 426 438 408 446 417 375 348 340 316 293 274 286 238 208 193 181 128 102
## 41 42 43 44 45 46 47 48 49 50 51 52 54 55 63 76
## 93 61 55 36 23 15 6 11 2 1 5 3 1 2 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 11.00 19.00 19.33 27.00 76.00
##
## includes extended item information - examples:
## labels
## 1 ...and you will know us by the trail of dead
## 2 [unknown]
## 3 2pac
We can also look at first five observations of our data set:
LIST(head(artists, 5))
## [[1]]
## [1] "dropkick murphys" "edguy"
## [3] "eluveitie" "goldfrapp"
## [5] "guano apes" "jack johnson"
## [7] "john mayer" "judas priest"
## [9] "le tigre" "red hot chili peppers"
## [11] "rob zombie" "schandmaul"
## [13] "the black dahlia murder" "the killers"
## [15] "the rolling stones" "the who"
##
## [[2]]
## [1] "aesop rock" "air" "amon tobin"
## [4] "animal collective" "aphex twin" "arcade fire"
## [7] "atmosphere" "autechre" "beastie boys"
## [10] "boards of canada" "broken social scene" "cocorosie"
## [13] "devendra banhart" "four tet" "goldfrapp"
## [16] "joanna newsom" "m83" "massive attack"
## [19] "max richter" "mf doom" "neutral milk hotel"
## [22] "pavement" "plaid" "portishead"
## [25] "prefuse 73" "radiohead" "sage francis"
## [28] "the books" "the flashbulb"
##
## [[3]]
## [1] "a tribe called quest" "air"
## [3] "battles" "beck"
## [5] "bon iver" "bonobo"
## [7] "dj shadow" "fleetwood mac"
## [9] "flight of the conchords" "kyuss"
## [11] "late of the pier" "led zeppelin"
## [13] "mgmt" "michael jackson"
## [15] "muse" "pink floyd"
## [17] "rjd2" "röyksopp"
## [19] "simian mobile disco" "snow patrol"
## [21] "the cinematic orchestra" "the decemberists"
## [23] "the flaming lips" "the prodigy"
## [25] "the rolling stones" "tool"
## [27] "tv on the radio"
##
## [[4]]
## [1] "ac/dc" "bob marley & the wailers"
## [3] "children of bodom" "dream theater"
## [5] "iron maiden" "megadeth"
## [7] "metallica" "nightwish"
## [9] "sublime" "trivium"
## [11] "volbeat"
##
## [[5]]
## [1] "depeche mode" "dream theater" "faith no more"
## [4] "green day" "iron maiden" "jay-z"
## [7] "justin timberlake" "kanye west" "lily allen"
## [10] "manu chao" "metallica" "muse"
## [13] "pearl jam" "pink floyd" "queen"
## [16] "sigur rós" "snow patrol" "stevie wonder"
## [19] "tenacious d" "the streets" "thievery corporation"
## [22] "type o negative" "u2"
And generate bar plot with twenty the most popular artists and their frequency in database:
itemFrequencyPlot(artists, topN = 20, type = "absolute",
main = "Artists frequency", cex.names = 0.80)
Or generate table with ten the least popular artists and their frequency in database:
theLeastPopular <- setNames(as.data.frame(itemFrequency(artists, type = "absolute")),
c("Frequency"))
head(theLeastPopular %>% arrange(Frequency), 10)
## Frequency
## mary j. blige 96
## thin lizzy 96
## schandmaul 98
## toto 98
## böhse onkelz 99
## mae 99
## nelly 99
## tarja turunen 99
## the brian jonestown massacre 99
## the raveonettes 99
Difference can be easily seen - 2700 versus 96 in playlists of particular users.
Moreover, we can link the coexistence of two products and look which artists are selected together most often:
twoDimTab <- crossTable(artists, measure = "count", sort = TRUE)
twoDimTab[1:10, 1:10]
## radiohead the beatles coldplay red hot chili peppers muse
## radiohead 2704 873 819 474 645
## the beatles 873 2668 665 508 408
## coldplay 819 665 2378 579 582
## red hot chili peppers 474 508 579 1786 406
## muse 645 408 582 406 1711
## metallica 263 336 261 402 261
## pink floyd 514 595 338 310 277
## linkin park 204 180 386 361 301
## nirvana 458 444 351 446 323
## the killers 431 352 616 320 446
## metallica pink floyd linkin park nirvana the killers
## radiohead 263 514 204 458 431
## the beatles 336 595 180 444 352
## coldplay 261 338 386 351 616
## red hot chili peppers 402 310 361 446 320
## muse 261 277 301 323 446
## metallica 1670 335 291 379 162
## pink floyd 335 1574 110 305 180
## linkin park 291 110 1473 241 263
## nirvana 379 305 241 1473 228
## the killers 162 180 263 228 1473
This can give us an initial picture of how musical tastes are shaped.
During analysis, we have two algorithms to use - Eclat and Apriori. Generally, both give us the same results in slightly different ways. The Eclat algorithm does not create rules - firstly, it searches frequent sets to limit the data set thanks to the eclat() function. After this, we can create rule use the ruleIndication() function. The apriori() function does these two things together, by only one command. Because of large data set, I selected the eclat() function cause it is faster than the apriori() one.
So now we want to limit the data set, setting the minimum support for the set as 0.01 and maximum length of set as 15:
freq.items <- eclat(artists, parameter = list(supp = 0.01, maxlen = 15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.01 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 150
##
## create itemset ...
## set transactions ...[1004 item(s), 15000 transaction(s)] done [0.05s].
## sorting and recoding items ... [655 item(s)] done [0.01s].
## creating sparse bit matrix ... [655 row(s), 15000 column(s)] done [0.00s].
## writing ... [1689 set(s)] done [0.51s].
## Creating S4 object ... done [0.00s].
inspect(head(freq.items, 10))
## items support transIdenticalToItemsets count
## [1] {kanye west,t.i.} 0.01040000 156 156
## [2] {rihanna,the pussycat dolls} 0.01040000 156 156
## [3] {kanye west,nas} 0.01040000 156 156
## [4] {jay-z,nas} 0.01060000 159 159
## [5] {metallica,motörhead} 0.01013333 152 152
## [6] {kylie minogue,madonna} 0.01093333 164 164
## [7] {coldplay,the fray} 0.01126667 169 169
## [8] {eric clapton,the beatles} 0.01040000 156 156
## [9] {nightwish,sonata arctica} 0.01346667 202 202
## [10] {andrew bird,radiohead} 0.01120000 168 168
And create the rules, setting minimum confidence as 0.5 so the strength of the rules is in range 50%-100%:
freq.rules <- ruleInduction(freq.items, artists, confidence = 0.5)
freq.rules
## set of 50 rules
inspect(head(freq.rules, 5))
## lhs rhs support confidence lift
## [1] {t.i.} => {kanye west} 0.01040000 0.5672727 8.854413
## [2] {the pussycat dolls} => {rihanna} 0.01040000 0.5777778 13.415893
## [3] {the fray} => {coldplay} 0.01126667 0.5168196 3.260006
## [4] {sonata arctica} => {nightwish} 0.01346667 0.5101010 8.236292
## [5] {judas priest} => {iron maiden} 0.01353333 0.5075000 8.562992
## itemset
## [1] 1
## [2] 2
## [3] 7
## [4] 9
## [5] 12
If we want to analyze the table above, ‘lhs’ and ‘rhs’ columns are about choice - if I decided to listen to ‘lhs’ then I would listen to ‘rhs’. Such a situation happens in 1.04% of all situations (‘support’ column) and the probability of appearing T.I. together with Kanye West on the list of favourite artists is 56.73% (‘confidence’ column). Column ‘lift’ suggests us that these two artists together on one list are 8.85 times more frequent than appearing separately.
Moreover, we can check significance of rules with Fisher’s test:
is.significant(freq.rules, artists)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE
We have four kinds of measures in association rules - Support, Confidence, Expected Confidence and Lift. The higher value they have, the better for our analysis. Below all four formulas:
Support = \(\frac{Number~of~transactions~with~both~A~and~B}{Total~number~of~transactions}\)
Confidence = \(\frac{Number~of~transactions~with~both~A~and~B}{Total~number~of~transactions~with~A}\)
Expected Confidence = \(\frac{Number~of~transactions~with~B}{Total~number~of~transactions}\)
Lift = \(\frac{Confidence}{Expected~Confidence}\)
And rules sorted by given measures:
rules.by.supp <- sort(freq.rules, by = "support", decreasing = TRUE)
inspect(head(rules.by.supp, 5))
## lhs rhs support confidence lift itemset
## [1] {beck} => {radiohead} 0.02926667 0.5092807 2.825152 614
## [2] {snow patrol} => {coldplay} 0.02646667 0.5251323 3.312441 426
## [3] {keane} => {coldplay} 0.02226667 0.6374046 4.020634 156
## [4] {blur} => {radiohead} 0.01753333 0.5228628 2.900496 152
## [5] {megadeth} => {metallica} 0.01626667 0.5281385 4.743759 52
rules.by.conf <- sort(freq.rules, by = "confidence", decreasing = TRUE)
inspect(head(rules.by.conf, 5))
## lhs rhs support confidence
## [1] {oasis,the killers} => {coldplay} 0.01113333 0.6626984
## [2] {sigur rós,the beatles} => {radiohead} 0.01046667 0.6434426
## [3] {keane} => {coldplay} 0.02226667 0.6374046
## [4] {radiohead,snow patrol} => {coldplay} 0.01006667 0.6344538
## [5] {coldplay,the smashing pumpkins} => {radiohead} 0.01093333 0.6283525
## lift itemset
## [1] 4.180183 598
## [2] 3.569393 755
## [3] 4.020634 156
## [4] 4.002021 423
## [5] 3.485683 719
rules.by.lift <- sort(freq.rules, by = "lift", decreasing = TRUE)
inspect(head(rules.by.lift, 5))
## lhs rhs support confidence lift
## [1] {the pussycat dolls} => {rihanna} 0.01040000 0.5777778 13.415893
## [2] {t.i.} => {kanye west} 0.01040000 0.5672727 8.854413
## [3] {judas priest} => {iron maiden} 0.01353333 0.5075000 8.562992
## [4] {sonata arctica} => {nightwish} 0.01346667 0.5101010 8.236292
## [5] {pink floyd,the doors} => {led zeppelin} 0.01066667 0.5387205 6.802027
## itemset
## [1] 2
## [2] 1
## [3] 12
## [4] 9
## [5] 466
Now we can plot rules - we use support and confidence as measures and lift as shading so the plot shows the relationship between all three parameters. Association rules are stronger when the rectangles are redder.
plot(freq.rules, method = "matrix", measure = c("support", "confidence"), shading = "lift")
Next plot shows the dominant elements of the rules - it’s grouped matrix for all 50 rules.
plot(freq.rules, method = "grouped")
Because of such a high number of rules, graph which is a network illustration of the rules is easier to analyze:
plot(freq.rules, method = "graph", shading = "lift")
Top rules without sorting by a given measure gives us quite diverse genres of music - pop, rap, rock or metal. If we use Confidence, rules are more focused on rock artists who are by far the most frequent on users playlists. That’s why I want to focus on the differences between likes of users who prefer this kind of music - how much different bands do they choose, despite liking the same genre? I want to analyze Top 5 - Radiohead, The Beatles, Coldplay, Red Hot Chili Peppers and Muse.
Firstly, I want to create rules for Radiohead with Confidence:
radiohead <- apriori(data = artists, parameter = list(supp = 0.01, conf = 0.5,
target = "rules"),
appearance = list(default = "lhs", rhs = "radiohead"),
control = list(verbose = FALSE))
radiohead.by.conf <- sort(radiohead, by = "confidence", decreasing = TRUE)
inspect(head(radiohead.by.conf, 5))
## lhs rhs support confidence
## [1] {sigur rós,the beatles} => {radiohead} 0.01046667 0.6434426
## [2] {coldplay,the smashing pumpkins} => {radiohead} 0.01093333 0.6283525
## [3] {the beatles,the smashing pumpkins} => {radiohead} 0.01146667 0.6209386
## [4] {beck,the beatles} => {radiohead} 0.01300000 0.5909091
## [5] {coldplay,sigur rós} => {radiohead} 0.01206667 0.5801282
## coverage lift count
## [1] 0.01626667 3.569393 157
## [2] 0.01740000 3.485683 164
## [3] 0.01846667 3.444556 172
## [4] 0.02200000 3.277972 195
## [5] 0.02080000 3.218167 181
And generate parallel coordinates plot for all oh them:
plot(radiohead.by.conf, method = "paracoord", control = list(reorder = TRUE))
This graph shows the complexity of rules containing a specific product. The arrows in the chart above represent the basket length for a given artist.
To better visualization we can also use a network illustration:
plot(radiohead.by.conf, method = "graph", shading = "lift")
The same actions for The Beatles:
the_beatles <- apriori(data = artists, parameter = list(supp = 0.01, conf = 0.5,
target = "rules"),
appearance = list(default = "lhs", rhs = "the beatles"),
control = list(verbose = FALSE))
the_beatles.by.conf <- sort(the_beatles, by = "confidence", decreasing = TRUE)
inspect(head(the_beatles.by.conf, 5))
## lhs rhs support confidence
## [1] {bob dylan,pink floyd} => {the beatles} 0.01033333 0.6150794
## [2] {bob dylan,the rolling stones} => {the beatles} 0.01146667 0.5910653
## [3] {led zeppelin,the rolling stones} => {the beatles} 0.01066667 0.5776173
## [4] {david bowie,pink floyd} => {the beatles} 0.01006667 0.5741445
## [5] {bob dylan,radiohead} => {the beatles} 0.01386667 0.5730028
## coverage lift count
## [1] 0.01680000 3.458092 155
## [2] 0.01940000 3.323081 172
## [3] 0.01846667 3.247474 160
## [4] 0.01753333 3.227949 151
## [5] 0.02420000 3.221530 208
And plots for 12 rules:
plot(the_beatles.by.conf, method = "paracoord", control = list(reorder = TRUE))
plot(the_beatles.by.conf, method = "graph", shading = "lift")
The same actions for Coldplay:
coldplay <- apriori(data = artists, parameter = list(supp = 0.01,
conf = 0.5, target = "rules"),
appearance = list(default = "lhs", rhs = "coldplay"),
control = list(verbose = FALSE))
coldplay.by.conf <- sort(coldplay, by = "confidence", decreasing = TRUE)
inspect(head(coldplay.by.conf, 5))
## lhs rhs support confidence
## [1] {oasis,the killers} => {coldplay} 0.01113333 0.6626984
## [2] {keane} => {coldplay} 0.02226667 0.6374046
## [3] {radiohead,snow patrol} => {coldplay} 0.01006667 0.6344538
## [4] {snow patrol,the killers} => {coldplay} 0.01040000 0.5954198
## [5] {death cab for cutie,the killers} => {coldplay} 0.01086667 0.5884477
## coverage lift count
## [1] 0.01680000 4.180183 167
## [2] 0.03493333 4.020634 334
## [3] 0.01586667 4.002021 151
## [4] 0.01746667 3.755802 156
## [5] 0.01846667 3.711823 163
And plots for 16 rules:
plot(coldplay.by.conf, method = "paracoord", control = list(reorder = TRUE))
plot(coldplay.by.conf, method = "graph", shading = "lift")
The same actions for Red Hot Chili Peppers:
rhcp <- apriori(data = artists, parameter = list(supp = 0.01,
conf = 0.5, target = "rules"),
appearance = list(default = "lhs", rhs = "red hot chili peppers"),
control = list(verbose = FALSE))
rhcp.by.conf <- sort(rhcp, by = "confidence", decreasing = TRUE)
inspect(head(rhcp.by.conf, 5))
Support with value=0.01 gives us 0 rules so we can try with value=0.005:
rhcp2 <- apriori(data = artists, parameter = list(supp = 0.005,
conf = 0.5, target = "rules"),
appearance = list(default = "lhs", rhs = "red hot chili peppers"),
control = list(verbose = FALSE))
rhcp.by.conf2 <- sort(rhcp2, by = "confidence", decreasing = TRUE)
inspect(head(rhcp.by.conf2, 5))
## lhs rhs support confidence
## [1] {foo fighters,led zeppelin} => {red hot chili peppers} 0.0058 0.5209581
## [2] {audioslave,foo fighters} => {red hot chili peppers} 0.0054 0.5192308
## coverage lift count
## [1] 0.01113333 4.375348 87
## [2] 0.01040000 4.360841 81
And plots for 2 rules:
plot(rhcp.by.conf2, method = "paracoord", control = list(reorder = TRUE))
plot(rhcp.by.conf2, method = "graph", shading = "lift")
The same actions for Muse:
muse <- apriori(data = artists, parameter = list(supp = 0.01,
conf = 0.5, target = "rules"),
appearance = list(default = "lhs", rhs = "muse"),
control = list(verbose = FALSE))
muse.by.conf <- sort(muse, by = "confidence", decreasing = TRUE)
inspect(head(muse.by.conf, 5))
## lhs rhs support confidence coverage lift count
## [1] {placebo,radiohead} => {muse} 0.01366667 0.5137845 0.0266 4.504247 205
And plots for only 1 rule:
plot(muse.by.conf, method = "paracoord", control = list(reorder = TRUE))
plot(muse.by.conf, method = "graph", shading = "lift")
As I expected, each artist from Top 5 is connected with minimum one other artist from Top 5 on users playlists. What’s interesting are the other artists - apart from having Top 5 artists, these singers/bands attracted completely different individuals. For example - for Radiohead we have rules with The Smashing Pumpkins, Beck or Sigur Rós and for The Beatles we have Bob Dylan, David Bowie Pink Floyd or Led Zeppelin.
My paper is not a typical paper about market basket analysis because of analyzing music market. It can be useful during recommendations in streaming services like Spotify or YouTube - especially these differences between given artists like between Top 5. Also it could be useful for stores with CDs and other music stuff. We can analyze where we should put some artists to sell more articles. Meanwhile, discovering patterns in the behavior of users or customers is the future of e-commerce and all websites - after all, it is used by the biggest, such as Instagram, Netflix or YouTube.