Introduction

Last.FM users listen to millions of songs every day, with many likely to share tastes and enjoying tracks of similar genres or styles. In this paper I will use apriori algorithm to try to search for association rules regarding popular artists on that platform. I will show which artists are the most listened to and which other musicians are likely to be featured in a user's playlist if some other artist appears there. Such an algorithm could be used for example to recommend users' other bands or what to listen to next.

Dataset I will work on comes from Kaggle and can be found under this link: https://www.kaggle.com/ravichaubey1506/lastfm

First load the data and necessary packages. Arules package is required for mining association rules and data.table is a package for data tidying.

library(arules)
library(data.table)
data <- as.data.table(read.csv("lastfm.csv", encoding = "UTF-8"))
head(data)
##    user                  artist sex country
## 1:    1   red hot chili peppers   f Germany
## 2:    1 the black dahlia murder   f Germany
## 3:    1               goldfrapp   f Germany
## 4:    1        dropkick murphys   f Germany
## 5:    1                le tigre   f Germany
## 6:    1              schandmaul   f Germany

Every row corresponds to one artist a user has been listening to, with every user having its own unique ID. Our data.frame also consists of information about gender and country of a user, but for now we don't need it.

data <- data[, c(-3, -4)]

We have a data frame, but the arules package needs a transactions type of object. We can, however, easily convert data frame to a needed format with the following command:

artists <- as(split(data[, artist], data[, user]), "transactions")
artists
## transactions in sparse format with
##  15000 transactions (rows) and
##  1004 items (columns)

There is information about 15,000 users in total who listened to 1,004 unique artists (or bands).

Let's see now which items appear most frequently. Support describes how frequently items appear in data. We will now plot bands that appeared in playlists of at least 10% of users. On a side note, it seems that the most popular genre among Last FM users is rock music.

itemFrequencyPlot(artists, topN=10, type="relative", main="Item Frequency") 

Association rules

Let's move now to association rules mining. I will use apriori algorithm, which searches for frequent items that occur together, first filtering items that don't meet a predetermined threshold of support. The apriori algorithm combines then an item with all the other items to create sets with two items that satisfy the required support. Then out of those two-item datasets all the possible rules are generated that satisfy the minimum confidence. Then the same is repeated but for frequent three-item datasets, four-item datasets and so on.

Confidence is a percentage of itemsets where the presence of one item results in the presence of another item. It is calculated by dividing transactions containing both X and Y by transactions containing X.

fm_rules <- apriori(artists, parameter = list(support = 0.005, confidence = 0.15, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.15    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 75 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1004 item(s), 15000 transaction(s)] done [0.08s].
## sorting and recoding items ... [1004 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.06s].
## writing ... [8944 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

It resulted in 8944 association rules.

fm_rules
## set of 8944 rules

We can also see summary statistics of the association rules.

summary(fm_rules)
## set of 8944 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4 
## 4517 4271  156 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.512   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence        coverage            lift       
##  Min.   :0.005000   Min.   :0.1500   Min.   :0.00640   Min.   : 0.845  
##  1st Qu.:0.005533   1st Qu.:0.2049   1st Qu.:0.01740   1st Qu.: 2.503  
##  Median :0.006467   Median :0.2759   Median :0.02493   Median : 3.402  
##  Mean   :0.007892   Mean   :0.3050   Mean   :0.03036   Mean   : 4.239  
##  3rd Qu.:0.008600   3rd Qu.:0.3807   3rd Qu.:0.03493   3rd Qu.: 5.032  
##  Max.   :0.058200   Max.   :0.7917   Max.   :0.18027   Max.   :45.740  
##      count      
##  Min.   : 75.0  
##  1st Qu.: 83.0  
##  Median : 97.0  
##  Mean   :118.4  
##  3rd Qu.:129.0  
##  Max.   :873.0  
## 
## mining info:
##     data ntransactions support confidence
##  artists         15000   0.005       0.15

As we can see there are 4517 rules with 2 items, 4271 rules with 3 items and 156 rules with 4 items.

Confidence, however, is not always the best statistic to find association between two items. If for example an item X is bought relatively rarely and has low support, we can find an association rule with item Y which has high support. If, for example, caviar was only bought a couple of times in a shop and it was usually bought together with bread, such an association rule will have a very high confidence just because bread is the most often bought product. It's usually not what we are really interested in though. Lift is a measure that takes into consideration also the support (frequency) of item Y (in the example it is bread). A lift higher than 1 indicates a positive association rule, values close to 1 independent itemsets and lower than 1 indicate that there is a negative association between the itemsets. For the association rules we obtained the average lift is equal to 4.2, which seems to be quite high. If we set a lower value of support or confidence for the apriori algorithm we would probably get more association rules but with lower lift value. Choosing the values of minimum support and confidence probably depends on the aim of the analysis though.

Coverage is a similar measure to support, but it only applies to left-hand side (LHS) items of a rule. In arules package documentation we can read that "It represents a measure of how often the rule can be applied."

Let's now see how the results of apriori alghorithm look like.

##      lhs                   rhs                support     confidence
## [1]  {ferry corsten}    => {armin van buuren} 0.005333333 0.6201550 
## [2]  {armin van buuren} => {ferry corsten}    0.005333333 0.3162055 
## [3]  {after forever}    => {nightwish}        0.005266667 0.5766423 
## [4]  {gamma ray}        => {helloween}        0.005400000 0.5827338 
## [5]  {helloween}        => {gamma ray}        0.005400000 0.2988930 
## [6]  {talib kweli}      => {kanye west}       0.005000000 0.4777070 
## [7]  {anthrax}          => {metallica}        0.005333333 0.5263158 
## [8]  {paul van dyk}     => {armin van buuren} 0.005400000 0.4628571 
## [9]  {armin van buuren} => {paul van dyk}     0.005400000 0.3201581 
## [10] {paul van dyk}     => {atb}              0.005200000 0.4457143 
##      coverage    lift      count
## [1]  0.008600000 36.768085 80   
## [2]  0.016866667 36.768085 80   
## [3]  0.009133333  9.310694 79   
## [4]  0.009266667 32.254639 81   
## [5]  0.018066667 32.254639 81   
## [6]  0.010466667  7.456405 75   
## [7]  0.010133333  4.727387 80   
## [8]  0.011666667 27.442123 81   
## [9]  0.016866667 27.442123 81   
## [10] 0.011666667 22.065064 78

We can order it by various statistics. Let's start with lift. Rules with highest lift value concern Madvillain and MF DOOM. The confidence value is also very high, so we can conclude that it is a very strong rule and people who listen to one are very probable to also listen to the other one. These artists weren't very popular among Last.FM users however. Support for those rules is equal to only 0.0053, only slightly above the threshold value we set for apriori algorithm.

inspect(sort(fm_rules, by = "lift")[1:20])
##      lhs                    rhs                  support     confidence
## [1]  {madvillain}        => {mf doom}            0.005333333 0.5031447 
## [2]  {mf doom}           => {madvillain}         0.005333333 0.4848485 
## [3]  {ferry corsten}     => {armin van buuren}   0.005333333 0.6201550 
## [4]  {armin van buuren}  => {ferry corsten}      0.005333333 0.3162055 
## [5]  {ben folds five}    => {ben folds}          0.005000000 0.5639098 
## [6]  {ben folds}         => {ben folds five}     0.005000000 0.3024194 
## [7]  {gamma ray}         => {helloween}          0.005400000 0.5827338 
## [8]  {helloween}         => {gamma ray}          0.005400000 0.2988930 
## [9]  {paul van dyk}      => {armin van buuren}   0.005400000 0.4628571 
## [10] {armin van buuren}  => {paul van dyk}       0.005400000 0.3201581 
## [11] {usher}             => {ne-yo}              0.005066667 0.4175824 
## [12] {ne-yo}             => {usher}              0.005066667 0.3261803 
## [13] {chris brown}       => {ne-yo}              0.005733333 0.3891403 
## [14] {ne-yo}             => {chris brown}        0.005733333 0.3690987 
## [15] {ludacris}          => {t.i.}               0.005333333 0.4444444 
## [16] {t.i.}              => {ludacris}           0.005333333 0.2909091 
## [17] {lady gaga,rihanna} => {the pussycat dolls} 0.005133333 0.4254144 
## [18] {the game}          => {t.i.}               0.006000000 0.4326923 
## [19] {t.i.}              => {the game}           0.006000000 0.3272727 
## [20] {beyoncé,rihanna}   => {the pussycat dolls} 0.005866667 0.4210526 
##      coverage    lift     count
## [1]  0.010600000 45.74042 80   
## [2]  0.011000000 45.74042 80   
## [3]  0.008600000 36.76809 80   
## [4]  0.016866667 36.76809 80   
## [5]  0.008866667 34.10745 75   
## [6]  0.016533333 34.10745 75   
## [7]  0.009266667 32.25464 81   
## [8]  0.018066667 32.25464 81   
## [9]  0.011666667 27.44212 81   
## [10] 0.016866667 27.44212 81   
## [11] 0.012133333 26.88299 76   
## [12] 0.015533333 26.88299 76   
## [13] 0.014733333 25.05195 86   
## [14] 0.015533333 25.05195 86   
## [15] 0.012000000 24.24242 80   
## [16] 0.018333333 24.24242 80   
## [17] 0.012066667 23.63413 77   
## [18] 0.013866667 23.60140 90   
## [19] 0.018333333 23.60140 90   
## [20] 0.013933333 23.39181 88

Here are the rules with the highest support. As we can see these include some of the most well-known bands in the world like The Beatles, Radiohead and Coldplay.

inspect(sort(fm_rules, by = "support")[1:20])
##      lhs                        rhs                     support    confidence
## [1]  {the beatles}           => {radiohead}             0.05820000 0.3272114 
## [2]  {radiohead}             => {the beatles}           0.05820000 0.3228550 
## [3]  {coldplay}              => {radiohead}             0.05460000 0.3444071 
## [4]  {radiohead}             => {coldplay}              0.05460000 0.3028846 
## [5]  {coldplay}              => {the beatles}           0.04433333 0.2796468 
## [6]  {the beatles}           => {coldplay}              0.04433333 0.2492504 
## [7]  {muse}                  => {radiohead}             0.04300000 0.3769725 
## [8]  {radiohead}             => {muse}                  0.04300000 0.2385355 
## [9]  {the killers}           => {coldplay}              0.04106667 0.4181942 
## [10] {coldplay}              => {the killers}           0.04106667 0.2590412 
## [11] {pink floyd}            => {the beatles}           0.03966667 0.3780178 
## [12] {the beatles}           => {pink floyd}            0.03966667 0.2230135 
## [13] {muse}                  => {coldplay}              0.03880000 0.3401520 
## [14] {coldplay}              => {muse}                  0.03880000 0.2447435 
## [15] {red hot chili peppers} => {coldplay}              0.03860000 0.3241881 
## [16] {coldplay}              => {red hot chili peppers} 0.03860000 0.2434819 
## [17] {bob dylan}             => {the beatles}           0.03446667 0.4971154 
## [18] {the beatles}           => {bob dylan}             0.03446667 0.1937781 
## [19] {pink floyd}            => {radiohead}             0.03426667 0.3265565 
## [20] {radiohead}             => {pink floyd}            0.03426667 0.1900888 
##      coverage   lift     count
## [1]  0.17786667 1.815152 873  
## [2]  0.18026667 1.815152 873  
## [3]  0.15853333 1.910542 819  
## [4]  0.18026667 1.910542 819  
## [5]  0.15853333 1.572227 665  
## [6]  0.17786667 1.572227 665  
## [7]  0.11406667 2.091194 645  
## [8]  0.18026667 2.091194 645  
## [9]  0.09820000 2.637894 616  
## [10] 0.15853333 2.637894 616  
## [11] 0.10493333 2.125287 595  
## [12] 0.17786667 2.125287 595  
## [13] 0.11406667 2.145618 582  
## [14] 0.15853333 2.145618 582  
## [15] 0.11906667 2.044921 579  
## [16] 0.15853333 2.044921 579  
## [17] 0.06933333 2.794877 517  
## [18] 0.17786667 2.794877 517  
## [19] 0.10493333 1.811519 514  
## [20] 0.18026667 1.811519 514

Now association rules ordered by confidence level. The highest confidence corresponds to Keane paired with Travis, U2, Snow Patrol or Oasis and Coldplay. In 80% of cases if someone listened to both of the artists in LHS, he also listened to Coldplay.

inspect(sort(fm_rules, by = "confidence")[1:20])
##      lhs                        rhs                support confidence    coverage      lift count
## [1]  {keane,                                                                                     
##       travis}                => {coldplay}     0.005066667  0.7916667 0.006400000  4.993692    76
## [2]  {keane,                                                                                     
##       u2}                    => {coldplay}     0.005800000  0.7767857 0.007466667  4.899826    87
## [3]  {keane,                                                                                     
##       snow patrol}           => {coldplay}     0.007866667  0.7564103 0.010400000  4.771301   118
## [4]  {keane,                                                                                     
##       oasis}                 => {coldplay}     0.006800000  0.7555556 0.009000000  4.765910   102
## [5]  {jay-z,                                                                                     
##       t.i.}                  => {kanye west}   0.005066667  0.7524752 0.006733333 11.745191    76
## [6]  {beck,                                                                                      
##       the smashing pumpkins} => {radiohead}    0.006466667  0.7519380 0.008600000  4.171254    97
## [7]  {oasis,                                                                                     
##       travis}                => {coldplay}     0.005400000  0.7431193 0.007266667  4.687464    81
## [8]  {broken social scene,                                                                       
##       the beatles}           => {radiohead}    0.006000000  0.7317073 0.008200000  4.059027    90
## [9]  {arctic monkeys,                                                                            
##       keane}                 => {coldplay}     0.005266667  0.7247706 0.007266667  4.571724    79
## [10] {oasis,                                                                                     
##       radiohead,                                                                                 
##       the killers}           => {coldplay}     0.005133333  0.7196262 0.007133333  4.539274    77
## [11] {muse,                                                                                      
##       oasis,                                                                                     
##       the killers}           => {coldplay}     0.005200000  0.7155963 0.007266667  4.513854    78
## [12] {björk,                                                                                     
##       the smashing pumpkins} => {radiohead}    0.005800000  0.7131148 0.008133333  3.955888    87
## [13] {air,                                                                                       
##       the smashing pumpkins} => {radiohead}    0.005733333  0.7107438 0.008066667  3.942736    86
## [14] {beyoncé,                                                                                   
##       the pussycat dolls}    => {rihanna}      0.005866667  0.7040000 0.008333333 16.346749    88
## [15] {jimi hendrix,                                                                              
##       pink floyd,                                                                                
##       the beatles}           => {led zeppelin} 0.005066667  0.7037037 0.007200000  8.885148    76
## [16] {placebo,                                                                                   
##       radiohead,                                                                                 
##       the killers}           => {muse}         0.005066667  0.7037037 0.007200000  6.169232    76
## [17] {muse,                                                                                      
##       travis}                => {coldplay}     0.005133333  0.7000000 0.007333333  4.415475    77
## [18] {broken social scene,                                                                       
##       the shins}             => {radiohead}    0.005733333  0.6991870 0.008200000  3.878626    86
## [19] {death cab for cutie,                                                                       
##       radiohead,                                                                                 
##       the killers}           => {coldplay}     0.005400000  0.6982759 0.007733333  4.404600    81
## [20] {sigur rĂłs,                                                                                 
##       the cure}              => {radiohead}    0.007066667  0.6973684 0.010133333  3.868538   106

It is also possible to see which artists people who decided for a certain band listened to. In other words what drives users to that band. Here is an example for a French duo Daft Punk. Most often Justice and The Chemical Brothers lead people to listen to Daft Punk.

rules.rootveg<-apriori(data=artists, parameter=list(supp=0.01,conf = 0.005), appearance=list(default="lhs", rhs="daft punk"),  control=list(verbose=F))
rules.rootveg.byconf<-sort(rules.rootveg, by="confidence", decreasing=TRUE)
inspect(head(rules.rootveg.byconf))
##     lhs                        rhs         support    confidence coverage  
## [1] {justice}               => {daft punk} 0.01453333 0.4266145  0.03406667
## [2] {the chemical brothers} => {daft punk} 0.01480000 0.3333333  0.04440000
## [3] {röyksopp}              => {daft punk} 0.01246667 0.2729927  0.04566667
## [4] {gorillaz}              => {daft punk} 0.01300000 0.2302243  0.05646667
## [5] {air}                   => {daft punk} 0.01466667 0.2107280  0.06960000
## [6] {moby}                  => {daft punk} 0.01140000 0.1879121  0.06066667
##     lift     count
## [1] 5.613348 218  
## [2] 4.385965 222  
## [3] 3.592009 187  
## [4] 3.029267 195  
## [5] 2.772736 220  
## [6] 2.472527 171

Out of interest we can see also examples for some other popular musicians.

Franz Ferdinand:

##     lhs                       rhs               support    confidence
## [1] {kaiser chiefs}        => {franz ferdinand} 0.01320000 0.4221748 
## [2] {arctic monkeys}       => {franz ferdinand} 0.01966667 0.2606007 
## [3] {coldplay,the killers} => {franz ferdinand} 0.01060000 0.2581169 
## [4] {the strokes}          => {franz ferdinand} 0.01380000 0.2555556 
## [5] {the kooks}            => {franz ferdinand} 0.01120000 0.2317241 
## [6] {the killers}          => {franz ferdinand} 0.02126667 0.2165648 
##     coverage   lift     count
## [1] 0.03126667 7.115306 198  
## [2] 0.07546667 4.392147 295  
## [3] 0.04106667 4.350285 159  
## [4] 0.05400000 4.307116 207  
## [5] 0.04833333 3.905463 168  
## [6] 0.09820000 3.649969 319

Paramore:

##     lhs               rhs        support    confidence coverage   lift    
## [1] {fall out boy} => {paramore} 0.01246667 0.24129032 0.05166667 5.299202
## [2] {blink-182}    => {paramore} 0.01026667 0.17824074 0.05760000 3.914511
## [3] {linkin park}  => {paramore} 0.01266667 0.12898846 0.09820000 2.832836
## [4] {muse}         => {paramore} 0.01000000 0.08766803 0.11406667 1.925359
## [5] {coldplay}     => {paramore} 0.01120000 0.07064760 0.15853333 1.551558
## [6] {}             => {paramore} 0.04553333 0.04553333 1.00000000 1.000000
##     count
## [1] 187  
## [2] 154  
## [3] 190  
## [4] 150  
## [5] 168  
## [6] 683

The Clash (empty LHS rule means that no matter what other items are involved, the item in RHS will appear with the probability given by rule's confidence [taken from arules package documentation]):

##     lhs              rhs         support    confidence coverage   lift    
## [1] {ramones}     => {the clash} 0.01046667 0.25864909 0.04046667 5.905230
## [2] {bob dylan}   => {the clash} 0.01073333 0.15480769 0.06933333 3.534422
## [3] {the beatles} => {the clash} 0.01560000 0.08770615 0.17786667 2.002423
## [4] {radiohead}   => {the clash} 0.01206667 0.06693787 0.18026667 1.528262
## [5] {}            => {the clash} 0.04380000 0.04380000 1.00000000 1.000000
##     count
## [1] 157  
## [2] 161  
## [3] 234  
## [4] 181  
## [5] 657

We can also check the opposite - what other bands people who listened to a certain musician enjoyed. Here example for a German band Rammstein.

rules.rootvegopp<-apriori(data=artists, parameter=list(supp=0.01,conf = 0.005), 
                          appearance=list(default="rhs", lhs="rammstein"), control=list(verbose=F)) 
rules.rootvegopp.byconf<-sort(rules.rootvegopp, by="confidence", decreasing=TRUE)
inspect(head(rules.rootvegopp.byconf))
##     lhs            rhs                support    confidence coverage   lift    
## [1] {rammstein} => {system of a down} 0.02333333 0.3330162  0.07006667 3.659518
## [2] {rammstein} => {metallica}        0.02140000 0.3054234  0.07006667 2.743324
## [3] {rammstein} => {nightwish}        0.01693333 0.2416746  0.07006667 3.902173
## [4] {rammstein} => {linkin park}      0.01526667 0.2178877  0.07006667 2.218816
## [5] {rammstein} => {marilyn manson}   0.01453333 0.2074215  0.07006667 4.256255
## [6] {rammstein} => {ko<U+042F>n}      0.01446667 0.2064700  0.07006667 4.190867
##     count
## [1] 350  
## [2] 321  
## [3] 254  
## [4] 229  
## [5] 218  
## [6] 217

With the help of arulesViz package we can also visualize the relationship between support and confidence for the obtained association rules. As we can see, the higher the support, rules with high support often have lower confidence than rules with low support. Actually the confidence is highest for rules with very low support. It coincides with our previous conclusions.

library(arulesViz)
plot(fm_rules)

Now let's see the relationship between support and lift. The highest lift is for rules with low support. Rules with high support have low values of lift.

plot(fm_rules, measure=c("support","lift"), shading="confidence")

The same plot as before (relationship between support and confidence with highlighted number of items in a rule).

plot(fm_rules, shading="order", control=list(main="Two-key plot"))

It is also possible to plot association rules with the highest support in form of a diagram. For a better readability I limited the data to 20 rules with the highest support.

plot(sort(fm_rules, by = "support")[1:20], method="graph")

Summary

Based on the analysis we can conclude that the obtained results are quite satisfactory. We got to know more about the most popular musicians on the Last.FM platform and learned more about association between them. Such knowledge could be very useful in providing users' recommendations or personalised advertising.