Context

Last.fm is a website that provies “audioscrobbler” software, which allows users to keep track of what they’ve listened to in terms of artist listen counts. It also allows them to “tag” artists with genres of their own choosing and creation.

Data

The data is available from the grouplens project here: https://grouplens.org/datasets/hetrec-2011/.

The dataset included information from 1,892 users, split across 4 main files.

Artists A listing of 17,632 unique artists, with unique identifiers.
Tags A listing of 11,946 user defined “tags”, i.e. musical genres.
User_artists A listing of 92,834 user-artist relations, defined by listen count.
User_tagged_artists A listing of 186,749 artist tag assignments by user.

library(dplyr)
library(tidyr)
library(lsa)

user_tags <- read.delim("data/user_taggedartists.dat", header = TRUE)
user_artists <- read.delim("data/user_artists.dat", header = TRUE)
tags <- read.delim("data/tags.dat", header = TRUE)

Feature Extraction

The main benefit of using tags were that the features were already defined by users. The drawback of tag was that due to their user defined nature, many weren’t used by more than one user. For example, tag #4550 was “worst song ever from miley”, which was not likely to be commonly used enough to use as a prominent feature.

To deal with this problem, I grouped all the used tags from the user_tagged_artist data, and used the 100 most commonly used tags, across all users.

#extract the 100 most frequently used tags
#the 100th most used tag has 315 tags
top_tags <- user_tags %>%
              group_by(tagID) %>%
              summarise(tag_count = n()) %>%
              arrange(desc(tag_count)) %>%
              slice(1:100)

The most used tag (73=rock) had 7,503 tags, while 100th most used tag (839=sad) had 315 tags.I used these tags as my 100 dimension vector that I would define Item and User profiles with. Data loss wasn’t bad, as 63% of User-artist-tag relations were preserved.

Item Profile

Item profiles are defined in terms of the top 100 tags. Depending on the popularity of the artist, some artists had much more plays and tags than others. To control for this, I normalized by dividing each artist’s total tag count for a given tag by the number of unique users who tagged that artist.

In effect, this was saying “of all users who tagged artist X, how many included at least tag Y?” FOr instance, 66% of users who tagged David Bowie, tagged him with “Classic Rock”.

#pair the artist-tag list down to the top 100 tags
#60.7% of data remains (113247/186479 tags)
artist_tags <- inner_join(user_tags, top_tags, by="tagID")

#item (artist)
artist_pro_data <- artist_tags %>% 
                    select(artistID, tagID) %>%
                    group_by(artistID, tagID) %>%
                    summarise( count = n()) %>%
                    arrange(desc(count)) %>%
                    as.data.frame()

#spread out the data by having each tag (genre) be a separate column                     
artist_matrix <- spread(artist_pro_data, tagID, count)


#number of distinct users who rated each artist
distinct_artist_user_count <- artist_tags %>%
                                group_by(artistID) %>%
                                summarise(unique_users = n_distinct(userID)) %>%
                                arrange(desc(unique_users))

#divide each row in artist matrix by the corresponding value in distinct_artist_user_count

artist_matrix <- inner_join(artist_matrix,distinct_artist_user_count)
## Joining, by = "artistID"

Item Profile Similarity

With the artist vectors built up, I was now in a position to compare artist profiles. To accomplish this, I used cosine similarity, which calculates the cosine of the angle between two vectors.

Here is a graphical representation in 2 dimensions where Theta represents the cosine of the angle between vectors a and b.

The same example applies to vectors in 100 dimensions. Another benefit of using cosine similarity is that the scale, or magnitude of the vectors is irrelevant, only their direction matters. Therefore I don’t need to worry about scaling the vectors, I can simply measure their direct differences using cosine similarity.

norm_artist_matrix <- artist_matrix

for (i in 1:nrow(artist_matrix)) {
  norm_artist_matrix[i,2:101] <- artist_matrix[i,2:101]/artist_matrix[i,102]
}

norm_artist_matrix0 <- norm_artist_matrix

norm_artist_matrix0[is.na(norm_artist_matrix0)] <- 0

Example

We will use Madonna as an example. Here is the artist page for Madonna from Last.fm (from https://www.last.fm/music/Madonna):

Now, I will pull the profile vectors associated with these artists.

madonna <- as.numeric(as.vector(norm_artist_matrix0[62,2:101]))
brittany_spears <- as.numeric(as.vector(norm_artist_matrix0[289,2:101]))
cher <- as.numeric(as.vector(norm_artist_matrix0[336,2:101]))
kylie_minogue <- as.numeric(as.vector(norm_artist_matrix0[51,2:101]))
depeche_mode <- as.numeric(as.vector(norm_artist_matrix0[67,2:101]))
lady_gaga <- as.numeric(as.vector(norm_artist_matrix0[84,2:101]))
radiohead <- as.numeric(as.vector(norm_artist_matrix0[124,2:101]))
talking_heads <- as.numeric(as.vector(norm_artist_matrix0[521,2:101]))
david_bowie <- as.numeric(as.vector(norm_artist_matrix0[599,2:101]))
wu_tang <- as.numeric(as.vector(norm_artist_matrix0[3197,2:101]))

Two vectors pointed in the same direction will have a cosine similarity of 1, so the closer to 1, the more similar the artist.

cosine(madonna, kylie_minogue)
##           [,1]
## [1,] 0.9573262
cosine(madonna, brittany_spears)
##           [,1]
## [1,] 0.9458821
cosine(madonna,cher)
##           [,1]
## [1,] 0.7205063

My method matches Last.fm’s suggested aritsts well, with the first two artists rated as extremely similar, and Cher slightly less (but still very) similar.

As a look at some artists that may not be as similar, I checked against Radiohead and Wu-Tang Clan.

cosine(madonna,radiohead)
##           [,1]
## [1,] 0.1502968
cosine(madonna,wu_tang)
##      [,1]
## [1,]    0

These wound up being not very similar.