1 Introduction

Association Rules and Market Basket Analysis are often used by shops - this part of unsupervised learning helps sellers know patterns and bindings of clients behaviors. Thanks to them sellers are able to predict which products are linked - if somebody buys X then he or she tends to buy Y. I want to use it to analyze patterns in the musical tastes of listeners - diverse in terms of country and gender. Just like Spotify or other streaming services recommendation system. Firstly, we need to load libraries:

library(dplyr)
library(arules)
library(arulesViz)

2 Data

In this paper, I am using database from Kaggle- free and open source of data sets. We have 15 000 users and their favourite artists - maximum number of artists for one user is 76. Let’s load the data:

artists <- read.transactions("data.csv", format = "basket", sep = ";", skip = 1)

And short summary of database:

summary(artists)
## transactions as itemMatrix in sparse format with
##  15000 rows (elements/itemsets/transactions) and
##  1004 columns (items) and a density of 0.01925319 
## 
## most frequent items:
##             radiohead           the beatles              coldplay 
##                  2704                  2668                  2378 
## red hot chili peppers                  muse               (Other) 
##                  1786                  1711                278706 
## 
## element (itemset/transaction) length distribution:
## sizes
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
## 185 222 280 302 359 385 472 461 491 501 504 482 472 471 479 477 456 455 444 455 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
## 436 478 426 438 408 446 417 375 348 340 316 293 274 286 238 208 193 181 128 102 
##  41  42  43  44  45  46  47  48  49  50  51  52  54  55  63  76 
##  93  61  55  36  23  15   6  11   2   1   5   3   1   2   1   1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   11.00   19.00   19.33   27.00   76.00 
## 
## includes extended item information - examples:
##                                         labels
## 1 ...and you will know us by the trail of dead
## 2                                    [unknown]
## 3                                         2pac

We can also look at first five observations of our data set:

LIST(head(artists, 5))
## [[1]]
##  [1] "dropkick murphys"        "edguy"                  
##  [3] "eluveitie"               "goldfrapp"              
##  [5] "guano apes"              "jack johnson"           
##  [7] "john mayer"              "judas priest"           
##  [9] "le tigre"                "red hot chili peppers"  
## [11] "rob zombie"              "schandmaul"             
## [13] "the black dahlia murder" "the killers"            
## [15] "the rolling stones"      "the who"                
## 
## [[2]]
##  [1] "aesop rock"          "air"                 "amon tobin"         
##  [4] "animal collective"   "aphex twin"          "arcade fire"        
##  [7] "atmosphere"          "autechre"            "beastie boys"       
## [10] "boards of canada"    "broken social scene" "cocorosie"          
## [13] "devendra banhart"    "four tet"            "goldfrapp"          
## [16] "joanna newsom"       "m83"                 "massive attack"     
## [19] "max richter"         "mf doom"             "neutral milk hotel" 
## [22] "pavement"            "plaid"               "portishead"         
## [25] "prefuse 73"          "radiohead"           "sage francis"       
## [28] "the books"           "the flashbulb"      
## 
## [[3]]
##  [1] "a tribe called quest"    "air"                    
##  [3] "battles"                 "beck"                   
##  [5] "bon iver"                "bonobo"                 
##  [7] "dj shadow"               "fleetwood mac"          
##  [9] "flight of the conchords" "kyuss"                  
## [11] "late of the pier"        "led zeppelin"           
## [13] "mgmt"                    "michael jackson"        
## [15] "muse"                    "pink floyd"             
## [17] "rjd2"                    "röyksopp"               
## [19] "simian mobile disco"     "snow patrol"            
## [21] "the cinematic orchestra" "the decemberists"       
## [23] "the flaming lips"        "the prodigy"            
## [25] "the rolling stones"      "tool"                   
## [27] "tv on the radio"        
## 
## [[4]]
##  [1] "ac/dc"                    "bob marley & the wailers"
##  [3] "children of bodom"        "dream theater"           
##  [5] "iron maiden"              "megadeth"                
##  [7] "metallica"                "nightwish"               
##  [9] "sublime"                  "trivium"                 
## [11] "volbeat"                 
## 
## [[5]]
##  [1] "depeche mode"         "dream theater"        "faith no more"       
##  [4] "green day"            "iron maiden"          "jay-z"               
##  [7] "justin timberlake"    "kanye west"           "lily allen"          
## [10] "manu chao"            "metallica"            "muse"                
## [13] "pearl jam"            "pink floyd"           "queen"               
## [16] "sigur rós"            "snow patrol"          "stevie wonder"       
## [19] "tenacious d"          "the streets"          "thievery corporation"
## [22] "type o negative"      "u2"

And generate bar plot with twenty the most popular artists and their frequency in database:

itemFrequencyPlot(artists, topN = 20, type = "absolute", 
                  main = "Artists frequency", cex.names = 0.80)


Or generate table with ten the least popular artists and their frequency in database:

theLeastPopular <- setNames(as.data.frame(itemFrequency(artists, type = "absolute")), 
                            c("Frequency"))
head(theLeastPopular %>% arrange(Frequency), 10)
##                              Frequency
## mary j. blige                       96
## thin lizzy                          96
## schandmaul                          98
## toto                                98
## böhse onkelz                        99
## mae                                 99
## nelly                               99
## tarja turunen                       99
## the brian jonestown massacre        99
## the raveonettes                     99

Difference can be easily seen - 2700 versus 96 in playlists of particular users.
Moreover, we can link the coexistence of two products and look which artists are selected together most often:

twoDimTab <- crossTable(artists, measure = "count", sort = TRUE)
twoDimTab[1:10, 1:10]
##                       radiohead the beatles coldplay red hot chili peppers muse
## radiohead                  2704         873      819                   474  645
## the beatles                 873        2668      665                   508  408
## coldplay                    819         665     2378                   579  582
## red hot chili peppers       474         508      579                  1786  406
## muse                        645         408      582                   406 1711
## metallica                   263         336      261                   402  261
## pink floyd                  514         595      338                   310  277
## linkin park                 204         180      386                   361  301
## nirvana                     458         444      351                   446  323
## the killers                 431         352      616                   320  446
##                       metallica pink floyd linkin park nirvana the killers
## radiohead                   263        514         204     458         431
## the beatles                 336        595         180     444         352
## coldplay                    261        338         386     351         616
## red hot chili peppers       402        310         361     446         320
## muse                        261        277         301     323         446
## metallica                  1670        335         291     379         162
## pink floyd                  335       1574         110     305         180
## linkin park                 291        110        1473     241         263
## nirvana                     379        305         241    1473         228
## the killers                 162        180         263     228        1473

This can give us an initial picture of how musical tastes are shaped.

3 Association rules - all together

3.1 Eclat or Apriori algorithm?

During analysis, we have two algorithms to use - Eclat and Apriori. Generally, both give us the same results in slightly different ways. The Eclat algorithm does not create rules - firstly, it searches frequent sets to limit the data set thanks to the eclat() function. After this, we can create rule use the ruleIndication() function. The apriori() function does these two things together, by only one command. Because of large data set, I selected the eclat() function cause it is faster than the apriori() one.
So now we want to limit the data set, setting the minimum support for the set as 0.01 and maximum length of set as 15:

freq.items <- eclat(artists, parameter = list(supp = 0.01, maxlen = 15))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.01      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 150 
## 
## create itemset ... 
## set transactions ...[1004 item(s), 15000 transaction(s)] done [0.05s].
## sorting and recoding items ... [655 item(s)] done [0.01s].
## creating sparse bit matrix ... [655 row(s), 15000 column(s)] done [0.00s].
## writing  ... [1689 set(s)] done [0.51s].
## Creating S4 object  ... done [0.00s].
inspect(head(freq.items, 10))
##      items                        support    transIdenticalToItemsets count
## [1]  {kanye west,t.i.}            0.01040000 156                      156  
## [2]  {rihanna,the pussycat dolls} 0.01040000 156                      156  
## [3]  {kanye west,nas}             0.01040000 156                      156  
## [4]  {jay-z,nas}                  0.01060000 159                      159  
## [5]  {metallica,motörhead}        0.01013333 152                      152  
## [6]  {kylie minogue,madonna}      0.01093333 164                      164  
## [7]  {coldplay,the fray}          0.01126667 169                      169  
## [8]  {eric clapton,the beatles}   0.01040000 156                      156  
## [9]  {nightwish,sonata arctica}   0.01346667 202                      202  
## [10] {andrew bird,radiohead}      0.01120000 168                      168

And create the rules, setting minimum confidence as 0.5 so the strength of the rules is in range 50%-100%:

freq.rules <- ruleInduction(freq.items, artists, confidence = 0.5) 
freq.rules
## set of 50 rules
inspect(head(freq.rules, 5))
##     lhs                     rhs           support    confidence lift     
## [1] {t.i.}               => {kanye west}  0.01040000 0.5672727   8.854413
## [2] {the pussycat dolls} => {rihanna}     0.01040000 0.5777778  13.415893
## [3] {the fray}           => {coldplay}    0.01126667 0.5168196   3.260006
## [4] {sonata arctica}     => {nightwish}   0.01346667 0.5101010   8.236292
## [5] {judas priest}       => {iron maiden} 0.01353333 0.5075000   8.562992
##     itemset
## [1]  1     
## [2]  2     
## [3]  7     
## [4]  9     
## [5] 12

If we want to analyze the table above, ‘lhs’ and ‘rhs’ columns are about choice - if I decided to listen to ‘lhs’ then I would listen to ‘rhs’. Such a situation happens in 1.04% of all situations (‘support’ column) and the probability of appearing T.I. together with Kanye West on the list of favourite artists is 56.73% (‘confidence’ column). Column ‘lift’ suggests us that these two artists together on one list are 8.85 times more frequent than appearing separately.
Moreover, we can check significance of rules with Fisher’s test:

is.significant(freq.rules, artists)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE

3.2 Different measures

We have four kinds of measures in association rules - Support, Confidence, Expected Confidence and Lift. The higher value they have, the better for our analysis. Below all four formulas:

Support = \(\frac{Number~of~transactions~with~both~A~and~B}{Total~number~of~transactions}\)

Confidence = \(\frac{Number~of~transactions~with~both~A~and~B}{Total~number~of~transactions~with~A}\)

Expected Confidence = \(\frac{Number~of~transactions~with~B}{Total~number~of~transactions}\)

Lift = \(\frac{Confidence}{Expected~Confidence}\)

And rules sorted by given measures:

rules.by.supp <- sort(freq.rules, by = "support", decreasing = TRUE)
inspect(head(rules.by.supp, 5))
##     lhs              rhs         support    confidence lift     itemset
## [1] {beck}        => {radiohead} 0.02926667 0.5092807  2.825152 614    
## [2] {snow patrol} => {coldplay}  0.02646667 0.5251323  3.312441 426    
## [3] {keane}       => {coldplay}  0.02226667 0.6374046  4.020634 156    
## [4] {blur}        => {radiohead} 0.01753333 0.5228628  2.900496 152    
## [5] {megadeth}    => {metallica} 0.01626667 0.5281385  4.743759  52
rules.by.conf <- sort(freq.rules, by = "confidence", decreasing = TRUE)
inspect(head(rules.by.conf, 5))
##     lhs                                 rhs         support    confidence
## [1] {oasis,the killers}              => {coldplay}  0.01113333 0.6626984 
## [2] {sigur rós,the beatles}          => {radiohead} 0.01046667 0.6434426 
## [3] {keane}                          => {coldplay}  0.02226667 0.6374046 
## [4] {radiohead,snow patrol}          => {coldplay}  0.01006667 0.6344538 
## [5] {coldplay,the smashing pumpkins} => {radiohead} 0.01093333 0.6283525 
##     lift     itemset
## [1] 4.180183 598    
## [2] 3.569393 755    
## [3] 4.020634 156    
## [4] 4.002021 423    
## [5] 3.485683 719
rules.by.lift <- sort(freq.rules, by = "lift", decreasing = TRUE)
inspect(head(rules.by.lift, 5))
##     lhs                       rhs            support    confidence lift     
## [1] {the pussycat dolls}   => {rihanna}      0.01040000 0.5777778  13.415893
## [2] {t.i.}                 => {kanye west}   0.01040000 0.5672727   8.854413
## [3] {judas priest}         => {iron maiden}  0.01353333 0.5075000   8.562992
## [4] {sonata arctica}       => {nightwish}    0.01346667 0.5101010   8.236292
## [5] {pink floyd,the doors} => {led zeppelin} 0.01066667 0.5387205   6.802027
##     itemset
## [1]   2    
## [2]   1    
## [3]  12    
## [4]   9    
## [5] 466

3.3 Visualization

Now we can plot rules - we use support and confidence as measures and lift as shading so the plot shows the relationship between all three parameters. Association rules are stronger when the rectangles are redder.

plot(freq.rules, method = "matrix", measure = c("support", "confidence"), shading = "lift")


Next plot shows the dominant elements of the rules - it’s grouped matrix for all 50 rules.

plot(freq.rules, method = "grouped")


Because of such a high number of rules, graph which is a network illustration of the rules is easier to analyze:

plot(freq.rules, method = "graph", shading = "lift")


4 Individual artists analysis

Top rules without sorting by a given measure gives us quite diverse genres of music - pop, rap, rock or metal. If we use Confidence, rules are more focused on rock artists who are by far the most frequent on users playlists. That’s why I want to focus on the differences between likes of users who prefer this kind of music - how much different bands do they choose, despite liking the same genre? I want to analyze Top 5 - Radiohead, The Beatles, Coldplay, Red Hot Chili Peppers and Muse.

4.1 Radiohead

Firstly, I want to create rules for Radiohead with Confidence:

radiohead <- apriori(data = artists,  parameter = list(supp = 0.01, conf = 0.5, 
                                                       target = "rules"), 
                     appearance = list(default = "lhs", rhs = "radiohead"), 
                     control = list(verbose = FALSE))
radiohead.by.conf <- sort(radiohead, by = "confidence", decreasing = TRUE)
inspect(head(radiohead.by.conf, 5))
##     lhs                                    rhs         support    confidence
## [1] {sigur rós,the beatles}             => {radiohead} 0.01046667 0.6434426 
## [2] {coldplay,the smashing pumpkins}    => {radiohead} 0.01093333 0.6283525 
## [3] {the beatles,the smashing pumpkins} => {radiohead} 0.01146667 0.6209386 
## [4] {beck,the beatles}                  => {radiohead} 0.01300000 0.5909091 
## [5] {coldplay,sigur rós}                => {radiohead} 0.01206667 0.5801282 
##     coverage   lift     count
## [1] 0.01626667 3.569393 157  
## [2] 0.01740000 3.485683 164  
## [3] 0.01846667 3.444556 172  
## [4] 0.02200000 3.277972 195  
## [5] 0.02080000 3.218167 181

And generate parallel coordinates plot for all oh them:

plot(radiohead.by.conf, method = "paracoord", control = list(reorder = TRUE))


This graph shows the complexity of rules containing a specific product. The arrows in the chart above represent the basket length for a given artist.
To better visualization we can also use a network illustration:

plot(radiohead.by.conf, method = "graph", shading = "lift")


4.2 The Beatles

The same actions for The Beatles:

the_beatles <- apriori(data = artists,  parameter = list(supp = 0.01, conf = 0.5, 
                                                         target = "rules"), 
                       appearance = list(default = "lhs", rhs = "the beatles"), 
                       control = list(verbose = FALSE))
the_beatles.by.conf <- sort(the_beatles, by = "confidence", decreasing = TRUE)
inspect(head(the_beatles.by.conf, 5))
##     lhs                                  rhs           support    confidence
## [1] {bob dylan,pink floyd}            => {the beatles} 0.01033333 0.6150794 
## [2] {bob dylan,the rolling stones}    => {the beatles} 0.01146667 0.5910653 
## [3] {led zeppelin,the rolling stones} => {the beatles} 0.01066667 0.5776173 
## [4] {david bowie,pink floyd}          => {the beatles} 0.01006667 0.5741445 
## [5] {bob dylan,radiohead}             => {the beatles} 0.01386667 0.5730028 
##     coverage   lift     count
## [1] 0.01680000 3.458092 155  
## [2] 0.01940000 3.323081 172  
## [3] 0.01846667 3.247474 160  
## [4] 0.01753333 3.227949 151  
## [5] 0.02420000 3.221530 208

And plots for 12 rules:

plot(the_beatles.by.conf, method = "paracoord", control = list(reorder = TRUE))


plot(the_beatles.by.conf, method = "graph", shading = "lift")


4.3 Coldplay

The same actions for Coldplay:

coldplay <- apriori(data = artists,  parameter = list(supp = 0.01, 
                                                      conf = 0.5, target = "rules"), 
                    appearance = list(default = "lhs", rhs = "coldplay"), 
                    control = list(verbose = FALSE))
coldplay.by.conf <- sort(coldplay, by = "confidence", decreasing = TRUE)
inspect(head(coldplay.by.conf, 5))
##     lhs                                  rhs        support    confidence
## [1] {oasis,the killers}               => {coldplay} 0.01113333 0.6626984 
## [2] {keane}                           => {coldplay} 0.02226667 0.6374046 
## [3] {radiohead,snow patrol}           => {coldplay} 0.01006667 0.6344538 
## [4] {snow patrol,the killers}         => {coldplay} 0.01040000 0.5954198 
## [5] {death cab for cutie,the killers} => {coldplay} 0.01086667 0.5884477 
##     coverage   lift     count
## [1] 0.01680000 4.180183 167  
## [2] 0.03493333 4.020634 334  
## [3] 0.01586667 4.002021 151  
## [4] 0.01746667 3.755802 156  
## [5] 0.01846667 3.711823 163

And plots for 16 rules:

plot(coldplay.by.conf, method = "paracoord", control = list(reorder = TRUE))


plot(coldplay.by.conf, method = "graph", shading = "lift")


4.4 Red Hot Chili Peppers

The same actions for Red Hot Chili Peppers:

rhcp <- apriori(data = artists,  parameter = list(supp = 0.01, 
                                                  conf = 0.5, target = "rules"), 
                appearance = list(default = "lhs", rhs = "red hot chili peppers"), 
                control = list(verbose = FALSE))
rhcp.by.conf <- sort(rhcp, by = "confidence", decreasing = TRUE)
inspect(head(rhcp.by.conf, 5))

Support with value=0.01 gives us 0 rules so we can try with value=0.005:

rhcp2 <- apriori(data = artists,  parameter = list(supp = 0.005, 
                                                   conf = 0.5, target = "rules"), 
                 appearance = list(default = "lhs", rhs = "red hot chili peppers"), 
                 control = list(verbose = FALSE))
rhcp.by.conf2 <- sort(rhcp2, by = "confidence", decreasing = TRUE)
inspect(head(rhcp.by.conf2, 5))
##     lhs                            rhs                     support confidence
## [1] {foo fighters,led zeppelin} => {red hot chili peppers} 0.0058  0.5209581 
## [2] {audioslave,foo fighters}   => {red hot chili peppers} 0.0054  0.5192308 
##     coverage   lift     count
## [1] 0.01113333 4.375348 87   
## [2] 0.01040000 4.360841 81

And plots for 2 rules:

plot(rhcp.by.conf2, method = "paracoord", control = list(reorder = TRUE))


plot(rhcp.by.conf2, method = "graph", shading = "lift")


4.5 Muse

The same actions for Muse:

muse <- apriori(data = artists,  parameter = list(supp = 0.01, 
                                                  conf = 0.5, target = "rules"), 
                appearance = list(default = "lhs", rhs = "muse"), 
                control = list(verbose = FALSE))
muse.by.conf <- sort(muse, by = "confidence", decreasing = TRUE)
inspect(head(muse.by.conf, 5))
##     lhs                    rhs    support    confidence coverage lift     count
## [1] {placebo,radiohead} => {muse} 0.01366667 0.5137845  0.0266   4.504247 205

And plots for only 1 rule:

plot(muse.by.conf, method = "paracoord", control = list(reorder = TRUE))


plot(muse.by.conf, method = "graph", shading = "lift")


4.6 Summary

As I expected, each artist from Top 5 is connected with minimum one other artist from Top 5 on users playlists. What’s interesting are the other artists - apart from having Top 5 artists, these singers/bands attracted completely different individuals. For example - for Radiohead we have rules with The Smashing Pumpkins, Beck or Sigur Rós and for The Beatles we have Bob Dylan, David Bowie Pink Floyd or Led Zeppelin.

5 Conclusions

My paper is not a typical paper about market basket analysis because of analyzing music market. It can be useful during recommendations in streaming services like Spotify or YouTube - especially these differences between given artists like between Top 5. Also it could be useful for stores with CDs and other music stuff. We can analyze where we should put some artists to sell more articles. Meanwhile, discovering patterns in the behavior of users or customers is the future of e-commerce and all websites - after all, it is used by the biggest, such as Instagram, Netflix or YouTube.