1 Project Introduction

Music streaming service companies such as Spotify and Pandora have built recommendaion systems to recommend songs and artists to users based upon user infomration. These systems mainly leverage user based collaborative filtering and content based models to implement the system. They continiusly learn fron the users behvior, interest and listening pattern.

The goal of this project is to implement a musical artist recommendations. Below are the details of the system implementaion:

Build a recommendation model using user-based collaborative filtering algorithm or UBCF
Recommend Similar Artists via the Artist-Genre Matrix
Recommend Artists for a Given User

2 Project Data - HetRec 2011

The datasets were generated by the Information Retrieval Group at Universidad Autónoma de Madrid (http://ir.ii.uam.es). Before using these datasets, please review the README files for the usage license and other details.

link: https://grouplens.org/datasets/hetrec-2011/

This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system. http://www.last.fm

Data statistics

1892 users
17632 artists
12717 bi-directional user friend relations, i.e. 25434 (user_i, user_j) pairs
- avg. 13.443 friend relations per user
92834 user-listened artist relations, i.e. tuples [user, artist, listeningCount]
- avg. 49.067 artists most listened by each user
- avg. 5.265 users who listened each artist
11946 tags
186479 tag assignments (tas), i.e. tuples [user, tag, artist]
- avg. 98.562 tas per user
- avg. 14.891 tas per artist
- avg. 18.930 distinct tags used by each user
- avg. 8.764 distinct tags used for each artist

3 Data Loading

artists.dat file contains information about music artists listened and tagged by the users.

# load list of last.fm artists: 
artists <- read_delim("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/hetrec2011-lastfm-2k/artists.dat", delim = "\t") %>% select(id, name)
dim(artists)

## [1] 16423     2

kable(head(artists,10))

id	name
1	MALICE MIZER
2	Diary of Dreams
3	Carpathian Forest
4	Moi dix Mois
5	Bella Morte
6	Moonspell
7	Marilyn Manson
8	DIR EN GREY
9	Combichrist
10	Grendel

user_artists.dat file contains the artists listened by each user. It also provides a listening count for each [user, artist] pair.

# loading last.fm user_artists file
ratings <- read.table("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/hetrec2011-lastfm-2k/user_artists.dat", header = TRUE, sep = "", stringsAsFactors = FALSE)
dim(ratings)

## [1] 92834     3

kable(head(ratings,10))

userID	artistID	weight
2	51	13883
2	52	11690
2	53	11351
2	54	10300
2	55	8983
2	56	6152
2	57	5955
2	58	4616
2	59	4337
2	60	4147

4 Exploratory Data Analysis and Visualization

In the below table TotalUsers column indicates the total number of last.fm users that have listened to the artist represented by the last.fm artist ID.

users_artists.stat <- ratings %>%
          group_by(artistID) %>%
          summarise(TotalUsers=length(unique(userID))) %>%
          arrange(desc(TotalUsers))

kable(head(users_artists.stat,10))

artistID	TotalUsers
89	611
289	522
288	484
227	480
300	473
67	429
333	417
292	407
190	400
498	399

library(ggplot2)
ggplot(data=head(users_artists.stat,100), aes(x=artistID, y=TotalUsers)) +
  geom_bar(stat="identity") +
  theme_bw()

Let’s build a user-artist-rating dataframe with artist’s name populated

users.artist.rating <- merge.data.frame(artists,ratings,by.x = 'id',by.y = 'artistID')
kable(head(users.artist.rating))

id	name	userID	weight
1	MALICE MIZER	34	212
1	MALICE MIZER	274	483
1	MALICE MIZER	785	76
2	Diary of Dreams	2066	83
2	Diary of Dreams	135	1021
2	Diary of Dreams	1604	455

Below dataframe shows artists names that are listened most by the users.

users.artist.info<- merge.data.frame(artists,users_artists.stat,by.x = 'id',by.y = 'artistID') %>%
                    arrange(desc(TotalUsers))
kable(head(users.artist.info,20))

id	name	TotalUsers
89	Lady Gaga	611
289	Britney Spears	522
288	Rihanna	484
227	The Beatles	480
300	Katy Perry	473
67	Madonna	429
333	Avril Lavigne	417
292	Christina Aguilera	407
190	Muse	400
498	Paramore	399
295	Beyoncé	397
154	Radiohead	393
65	Coldplay	369
466	Ke$ha	362
701	Shakira	319
302	P!nk	305
229	The Killers	304
306	Black Eyed Peas	304
55	Kylie Minogue	298
461	Miley Cyrus	286

Let’s focus on top 1000 artists as determined by the number of listeners/ users.

artists.top1000  = users_artists.stat[1:1000,]
kable(head(artists.top1000))

artistID	TotalUsers
89	611
289	522
288	484
227	480
300	473
67	429

missing <- subset(artists.top1000, !(artistID %in% users.artist.info$id))
length(missing$artistID)

## [1] 182

Since 182 artists must be removed, this leaves us with a total of 818 artists to be retained within our user-artists data.

lastfm.ua <- subset(ratings, artistID %in% artists.top1000$artistID)
lastfm.ua <- subset(ratings, !(artistID %in% missing$artistID))
top_818 <- subset(artists.top1000, !(artistID %in% missing$artistID))
rm(artists.top1000)
kable(head(top_818))

artistID	TotalUsers
89	611
289	522
288	484
227	480
300	473
67	429

tags.dat file contains the set of tags available in the dataset. Load list of genres / tags:

lastfm.tags <- read_delim("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/hetrec2011-lastfm-2k/tags.dat", delim = "\t")
dim(lastfm.tags)

## [1] 11946     2

user_taggedartists.dat file contains the tag assignments of artists provided by each particular user. Load list of UserID / Tag pairs:

users.tags <- read_delim("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/hetrec2011-lastfm-2k/user_taggedartists.dat", delim = "\t") %>% select(userID, artistID, tagID)
dim(lastfm.tags)

## [1] 11946     2

kable(head(users.tags))

userID	artistID	tagID
2	52	13
2	52	15
2	52	18
2	52	21
2	52	41
2	63	13

users_artists.tag.stat <- users.tags %>%
          group_by(tagID) %>%
          summarise(TotalUsers=length(unique(userID))) %>%
          arrange(desc(TotalUsers))

kable(head(users_artists.tag.stat))

tagID	TotalUsers
73	673
24	585
79	532
18	503
81	450
130	440

Considering Top 200 tags only for analysis purpose

tag_200 <- users_artists.tag.stat[1:200,]
tag_200 <- arrange(tag_200, tagID)
users.artist.tag.info<- merge.data.frame(lastfm.tags,users_artists.tag.stat,by.x = 'tagID',by.y = 'tagID') %>%
                    arrange(desc(TotalUsers))
kable(head(users.artist.tag.info,20))

tagID	tagValue	TotalUsers
73	rock	673
24	pop	585
79	alternative	532
18	electronic	503
81	indie	450
130	female vocalists	440
78	alternative rock	374
39	dance	370
84	indie rock	262
195	british	254
192	classic rock	252
25	80s	248
33	experimental	247
181	punk	243
14	ambient	242
1	metal	237
72	hard rock	235
134	singer-songwriter	221
292	folk	211
191	instrumental	195

Now, let’s exclude artists not in top 818 list from the above artists analysis

user_top_tags<- subset(users.tags, tagID %in% tag_200$tagID)
user_top_tags<- subset(user_top_tags, artistID %in% top_818$artistID)
length(unique(user_top_tags$artistID))

## [1] 815

Now ensure that artistID / tagID pairs are summarized by counting the number of times any given tag has been applied to an artist:

user.topTags <- summarise(group_by(user_top_tags, artistID, tagID ),
                       Count = length(tagID) )
length(unique(user.topTags$artistID))

## [1] 815

So we have a mismatch: 3 of the 818 artists we retained from the user-artists data have not been genre tagged by any user via the top 200 genre tags

Which 3?

not_tagged <- subset(top_818, !(artistID %in% user_top_tags$artistID))
not_tagged

## # A tibble: 3 x 2
##   artistID TotalUsers
##      <int>      <int>
## 1     1627         19
## 2     3378         18
## 3     4350         16

not_tagged$artistID %in% users.tags$artistID

## [1] TRUE TRUE TRUE

Remove 3 non-tagged artists from top_818

top_815 <- subset(top_818, artistID %in% user_top_tags$artistID)
length(unique(top_815$artistID))

## [1] 815

Remove artists not tagged using top 200 tags from user-artists data

lastfm.ua <- subset(lastfm.ua, artistID %in% top_815$artistID)
length(unique(lastfm.ua$userID))

## [1] 1870

1870 users retained - same as with top 818 artists

4.1 Create User-Artist Matrix

One of our goals for this project include the use of a user-based collaborative filter for purposes of generating recommendations of musical artists to last.fm users, we need to convert our reduced user-artists data to a user-item matrix

# convert to wide format
lastfm_mat <- spread(lastfm.ua, artistID, weight)
user_ids <- as.vector(lastfm_mat$userID)
lastfm_rating_mat <- as.matrix(lastfm_mat[,2:ncol(lastfm_mat)])

Calculate density / sparsity

nratings <- length(as.vector(lastfm_rating_mat))
nratings

## [1] 1524050

sum(!is.na(as.vector(lastfm_rating_mat)) ) / 
1 - sum(!is.na(as.vector(lastfm_rating_mat)) ) / nratings

## [1] 51364.97

4.2 Create Artist - Genre Matrix

For content-based recommendation efforts for this project, we need to create artist-genre matrix and provide similar artist recommendations to users.

# convert to wide format
tmp_mat <- spread(user.topTags, tagID, Count)
ag_artistID <- as.vector(tmp_mat$artistID)
art.genre.mat <- as.matrix(tmp_mat[,2:ncol(tmp_mat)])

Validating the content of artist-genre matrix against original user-tags data

nrow(subset(users.tags, artistID == 89 & tagID == 24))

## [1] 99

nrow(subset(users.tags, artistID == 72 & tagID == 18))

## [1] 68

Calculate density / sparsity

ntags <- length(as.vector(art.genre.mat))
ntags

## [1] 163000

sum(!is.na(as.vector(art.genre.mat)) ) / ntags

## [1] 0.09107975

1 - sum(!is.na(as.vector(art.genre.mat)) ) / ntags

## [1] 0.9089202

The artist-genre matrix can be binarized as follows:

bin_art.gen.mat <- art.genre.mat
bin_art.gen.mat[,][is.na(bin_art.gen.mat[,])] <- 0
bin_art.gen.mat[,][bin_art.gen.mat[,] > 0] <- 1

The user-item matrix can be binarized as follows:

bin.ua.rating.mat <- lastfm_rating_mat
bin.ua.rating.mat[,][is.na(bin.ua.rating.mat[,])] <- 0
bin.ua.rating.mat[,][bin.ua.rating.mat[,] > 0] <- 1

5 Building a User-Based Collaborative Filter

userArtist_binaryMat <- as(bin.ua.rating.mat,"binaryRatingMatrix")
e_bin <- evaluationScheme(userArtist_binaryMat, method="split", train=0.8, given = 1, goodRating = 1)
n_recommended <- 10
bin_rec <- Recommender(getData(e_bin, "train"), "UBCF", 
                          parameter = list(method = "Jaccard"))
bin_pred <- predict(bin_rec, getData(e_bin, "known"), n = n_recommended, goodRating = 1)
error_b <- calcPredictionAccuracy(bin_pred, getData(e_bin, "unknown"), 
                                  given = n_recommended, goodRating = 1)
kable(error_b, caption = "Performance Metrics")

Performance Metrics
	x
TP	3.1443850
FP	6.8556150
FN	22.9518717
TN	772.0481283
precision	0.3144385
recall	0.1313149
TPR	0.1313149
FPR	0.0087814

The performance metrics for the model are shown above. As we can see, the precision and recall are both relatively low, an also the true positive rate.

bin_pred@items[1:4]

## [[1]]
##  [1] 254 169 184 129 128 229 140  37 172 244
## 
## [[2]]
##  [1] 142 128 131  37 129 132 135 169 140 338
## 
## [[3]]
##  [1]  78  47 524  99 518 104 244  71 267 254
## 
## [[4]]
##  [1] 128 129  37 140 132  23 416 561 418  13

n_recommended <- 20
bin_pred <- predict(bin_rec, userArtist_binaryMat, n = n_recommended, goodRating = 1)
bin_pred@items[20:23]

## $`20`
##  [1]  78 184 658 427 256 343 369 484 428 461 641 545 370 429  71 643 534
## [18] 431 509 671
## 
## $`21`
##  [1] 599  47 105 311 295 519 342 216  17 349 579 653 524 450  90 206 463
## [18]  84  59 453
## 
## $`22`
##  [1] 132 338 145 272 173 139 233 229 270 121 134  40 289 165 172 161 505
## [18] 174 420 133
## 
## $`23`
##  [1] 117 139 145 131 140 169 233 127 270 141 156 272 229 168 232 172 138
## [18] 125  51 645

6 Recommend Artists for a Given User

#create a data frame to hold 10 recommendations for each artist
u_tenRec <- data.frame(matrix(ncol = 11, nrow = length(user_ids)))
u_tenRec[,1] <- user_ids
colnames(u_tenRec) <- c("userID", "r1", "r2", "r3", "r4", "r5",
                            "r6", "r7", "r8", "r9", "r10")
# load the recommendations
for (i in 1:length(bin_pred@items)){
  tmp_recs <- as.vector(bin_pred@items[[i]])
  num_trecs <- length(tmp_recs)
  user_arts <- unique(subset(lastfm.ua, userID == user_ids[i])$artistID)
  new_recs <- tmp_recs[!(tmp_recs %in% user_arts) ]
  num_newrecs <- length(new_recs)
  if(num_newrecs < 10) {
    new_recs <- sample(top_815$artistID[!(top_815$artistID %in% user_arts)], 10)
  }
  
  if (num_newrecs < 13) {
    topten <- new_recs[1:10]
  } else {
    t_seven <- sample(new_recs[1:10], 7)
    t_three <- sample(new_recs[11:length(new_recs)], 3)
    topten <- c(t_seven, t_three)
  } 
  topten <- sample(topten, 10)
  u_tenRec[i,2:11 ] <- topten
  
}

# randomly select a user
user <- sample(user_ids, 1)
urecs <- sort(as.vector(subset(u_tenRec, userID == user)[2:11]) )
rec_names <- subset(artists, id %in% urecs)$name
kable(rec_names, col.names = "Artists You Might Like")

Artists You Might Like
Spandau Ballet
Katie Melua
CAKE
Andrew Bird
Babyshambles
Katy Perry
Monrose
Aaron Carter
Renato Carosone
at

7 Recommend Similar Artists via the Artist-Genre Matrix

Create a similarity matrix using cosine distance as metric of similarity

# calculate artist similarity matrix
artist_sim <- similarity(as(bin_art.gen.mat, "binaryRatingMatrix"), method = "cosine",
                     which = "users")
artist_sim <- as(artist_sim, "matrix")
artist_sim[][] <- round(artist_sim[][],3)
colnames(artist_sim) <- ag_artistID
rownames(artist_sim) <- ag_artistID

image(as.matrix(artist_sim), main = "Artists similarity")

n_recommended <- 5
  artist <- sample(ag_artistID, 1)
a_name <- artists[artists$id == artist,]$name
arecs <- sort(artist_sim[as.character(artist),], decreasing = TRUE)[1:n_recommended]
arecs_IDs <- as.numeric(names(arecs))
arec_names <- artists[artists$id %in% arecs_IDs,]$name
table_head <- sprintf("Artists Similar to %s", a_name)
kable(arec_names, col.names = table_head)

Artists Similar to Cradle of Filth
Dimmu Borgir
Paradise Lost
Epica
Deathstars
Burzum

8 Conclusion

This was a wonderful dataset to build recommendation system driven by collaborative user based filtering and content based model technques. By binarizing the data the model did work better and with the low precision and recall rate and also true positive stayed low. Also, made recommendation for a given user and similar artists successfully. In future I would like to build an intractive system using Shiny to domonstrate the implemented features.

DATA643_FinalProject

Niteen Kumar

July 16, 2018