For this final Data 607 project, I build two recommender models using the recommenderlab
R library:
The data for this project comes from Last.fm user listening history, and was accessed using the Last.fm API.
In the sections below, I perform the following procedures:
The scripts used to access the Last.fm API data required 5+ hours to execute. For faster execution of this R Markdown document, I saved the data files from my API-related work to my Github account.
To run the remaining code in this document, please follow these steps:
Download and unzip the zip file, csv_files.zip
,from my Github account. Save to your working directory.
loadlastfm.sql
file, and open in MySQL. The scripts in this file create the database schema and load the csv files. Make sure to change the directories referenced in the SQL document to your working directory before running.Keep your connection to MySQL server open and run the remaining code in this document.
Here are example scripts to retrieve Github documents:
# set to your working directory
setwd("C:/Users/Aaron/Google Drive/Documents/cuny/Data607/FinalProject/new2")
# url for csv files
myurl <-"https://raw.githubusercontent.com/spitakiss/Data607Final/master/csv_files.zip"
# url for sql document
myurl2 <- "https://raw.githubusercontent.com/spitakiss/Data607Final/master/loadlastfm.sql"
# download and unzip files
download.file(myurl,destfile='csv_files.zip')
download.file(myurl2, destfile = 'loadlastfm.sql')
unzip('csv_files.zip')
Note: The package RLastFM
must be manually downloaded and installed. Scripts are provided below to carryout this procedure.
if (!require(XML)) {install.packages('XML');require(XML)}
if (!require(wordcloud)) {install.packages('wordcloud');require(wordcloud)}
if (!require(ggplot2)) {install.packages('ggplot2');require(ggplot2)}
if (!require(rworldmap)) {install.packages('rworldmap');require(rworldmap)}
if (!require(RCurl)) {install.packages('RCurl');require(RCurl)}
if (!require(knitr)) {install.packages('knitr');require(knitr)}
if (!require(recommenderlab)) {install.packages('recommenderlab');require(recommenderlab)}
if (!require(jsonlite)) {install.packages('jsonlite');require(jsonlite)}
if (!require(dplyr)) {install.packages('dplyr');require(dplyr)}
if (!require(tidyr)) {install.packages('tidyr');require(tidyr)}
if (!require(RMySQL)) {install.packages('RMySQL');require(RMySQL)}
if (!require(getPass)) {install.packages('getPass');require(getPass)}
if (!require(gridExtra)) {install.packages('gridExtra');require(gridExtra)}
# manual install of RLastFM package
if (!require(RLastFM)) {
download.file('http://cran.r-project.org/src/contrib/Archive/RLastFM/RLastFM_0.1-5.tar.gz','RLastFM_0.1-5.tar.gz')
install.packages('RLastFM_0.1-5.tar.gz', repos = NULL, type = "source")
require(RLastFM)
}
In this section, I retrieve relevant Last.fm data using functions from the RLastFM
package, where possible. The RLastFM
library was built to connect with a previous build of the Last.fm API; so some of the package functions no longer work properly. Also, the current build of the Last.fm API has new methods that cannot be accessed using RLastFM
. Finally, the available functions in this R package often do not allow the user full access to optional API method parameters. For these reasons, I occasionally used custom functions (in concert with the jsonlite
package) to access the API data.
In the prior build of the Last.fm API, there was an available method to retrieve Last.fm usernames associated with a particular artist. Unfortunately, this method is no longer available. To generate a list of usernames, I visited the Last.fm artist webpage for a musician that I currently follow, Andy Shauf. The artist’s page lists the usernames of top listeners. Using just this username, I generated additional Last.fm usernames using the user.getFriends
method. A fairly small loop can then be used to generate a sizable vector of usernames.
# api key
my_key <- "1ba315d4d1673bbf88aed473f1917306"
# seed user name, based on manual review of artist, Andy Shauf's Last.fm webpage.
user_nm <- c("nellsie")
# function to append unique user names to maste vector. This function implements the
# user.getFriends() method from RLastFM package.
append_friends <- function (x,y){
friends <- user.getFriends(x,key=my_key)$username
for (i in friends){
if (!(i %in% y)){
y <- append(y,i)
}
}
return(y)
}
# loop through friends, friends of friends, etc.
for (i in 1:50){
user_nm <- append_friends(user_nm[i],user_nm)
Sys.sleep(0.2)
}
# create user data frame with unique user_id
user_id <- seq(1:length(user_nm))
user_df <- data.frame(cbind(user_id,user_nm),stringsAsFactors=FALSE)
Now, pull in relevant user demographics, and merge with prior user data frame.
# getInfo() function; not available in RLastFM package
user.getInfo <- function(user, key=my_key){
base_url <- "http://ws.audioscrobbler.com/2.0/?method="
method <- "user.getinfo"
user_string <- "&user="
key_string <- "&api_key="
format_string <- "&format=json"
return (fromJSON(paste0(base_url,method, user_string,user,key_string,key,format_string)))
}
# data frame of demo data for all users
demo_data <- data.frame(user_nm=character(),country = character(), ct = integer())
for (name in user_df$user_nm){
temp <- user.getInfo(name)
temp_nm <- user_df$user_nm[which(user_df$user_nm==name)]
temp_country <- temp$user$country
temp_ct <- temp$user$playcount
temp_col <- cbind(user_nm = temp_nm,country=temp_country, ct = temp_ct)
demo_data <- rbind(demo_data, temp_col)
Sys.sleep(0.2)
}
# merge demo data with initial user data frame
user_df <- inner_join(user_df,demo_data, by="user_nm")
# make sure all user names are UTF-8 compliant
user_df$user_nm <- iconv(user_df$user_nm, "UTF-8", sub='')
In the following scripts, I pull artist information, such as artist name.
I also generate a data frame with the top 100 artists listened to by each user for the last twelve months.
# function to retrive users' top artists
# Using custom function rather than RLastFM functon to obtain more refined results
user.getTopArtistsMod <- function(user, key=my_key, period="12month", limit=100, page=1){
base_url <- "http://ws.audioscrobbler.com/2.0/?method="
method <- "user.gettopartists"
user_string <- "&user="
key_string <- "&api_key="
format_string <- "&format=json"
period_string <- "&period="
limit_string <- "&limit="
page_string <- "&page="
return (fromJSON(paste0(base_url,method, user_string,user,key_string,key,format_string,
period_string, period,limit_string,limit,page_string,page)))
}
# obtain user/top artist combinations
user_artist <- data.frame(name = character(), playcount = integer(),mbid= character(), user_id = integer())
for (name in user_df$user_nm){
temp <- user.getTopArtistsMod(name)$topartists$artist[c("name","playcount","mbid")]
temp$user_id <- user_df$user_id[user_df$user_nm == name]
user_artist <- rbind(user_artist,temp)
Sys.sleep(0.2)
}
# artists data frame
artists <- unique(user_artist[c("name","mbid")])
artist_id <- seq(1:nrow(artists))
artists <- cbind(artists, artist_id)
artists <- subset(artists,select=c(artist_id,name,mbid))
# clean up user_artist dataframe for database export
user_artist <-inner_join(user_artist, artists, by="name")
user_artist <- subset(user_artist, select=c(user_id,artist_id,playcount))
# getTopArtists API method sometimes (pretty rare) returns same artist twice for one user.
# Code below cleans up duplicates
user_artist <- user_artist %>%
group_by(user_id, artist_id) %>%
summarise(playcount = sum(playcount))
# Make sure all artist names are UTF-8
artists$name <- iconv(artists$name, "UTF-8", sub='')
Artist tags are user-assigned descriptions of artists, e.g. “grunge”, “1980s”, or “krautrock.”
# artist/tag combination data frame
artist_tag <- data.frame(artist_id = integer(),tag = character(),count = integer() )
# part 1: populate artist_tags data frame by matching on unique artist mbid (musicbrainz id), where available
for (i in 1:nrow(artists)){
if (artists[i,]$mbid != ""){
temp <- tryCatch(artist.getTopTags(artist=NA, mbid=artists[i,]$mbid),error=function(e) NULL)
temp_tag <- temp$tag
temp_ct <- temp$count
temp_id <- rep(artists[i,]$artist_id,length(temp_tag))
temp_col <- data.frame( artist_id= temp_id, tag=temp_tag,count=temp_ct)
artist_tag <- rbind(temp_col, artist_tag)
Sys.sleep(0.2)
}
}
# unique artist_ids from part 1
artist_unq <- unique(artist_tag$artist_id)
# part 2: populate artist_tags by matching on artist name if no mbid or no match from previous step
for (i in 1:nrow(artists)){
if (!(artists[i,]$artist_id %in% artist_unq)){
temp <- tryCatch(artist.getTopTags(artist=artists[i,]$name),error=function(e) NULL)
temp_tag <- temp$tag
temp_ct <- temp$count
temp_id <- rep(artists[i,]$artist_id,length(temp_tag))
temp_col <- data.frame( artist_id= temp_id, tag=temp_tag,count=temp_ct)
artist_tag <- rbind(temp_col, artist_tag)
Sys.sleep(0.2)
}
}
# clean up artist_tag table
artist_tag$tag <- tolower(artist_tag$tag)
artist_tag <- artist_tag %>%
group_by(artist_id,tag) %>%
summarise(count = sum(count))
artist_tag <- data.frame(artist_tag)
# unique tags data frame
tags <- artist_tag %>%
group_by(tag) %>%
summarise(count = sum(count))
tag_id <- seq(1:nrow(tags))
tags$tag_id <- tag_id
tags <- subset(tags, select=c(tag_id,tag,count))
tags <- data.frame(tags)
# prepare artist_tags data frame for database export
artist_tag <- inner_join(artist_tag,tags, by="tag")
artist_tag <- subset(artist_tag, select=c(artist_id,tag_id,count.x))
names(artist_tag) <- c("artist_id","tag_id","count")
# make sure all tag names are UTF-8 compliant
tags$tag <- iconv(tags$tag, "UTF-8", sub='')
Let’s pull in the relevant data from our MySQL database:
# MySQL connection. Using getPass() function to prompt user for username and password
my_db = dbConnect(MySQL(), user=getPass(msg = 'Enter username'), password= getPass('Enter Password'),
dbname='lastfm_db')
# pull in user country data
users <- dbGetQuery(my_db, "SELECT country, COUNT(*) AS country_ct
FROM users
GROUP BY country
ORDER BY COUNT(*) DESC;")
# retrieve most popular artist descriptive tags
tags <- dbGetQuery(my_db, "SELECT tag_nm, ttl_tag_ct FROM tags
ORDER BY ttl_tag_ct DESC
LIMIT 200;")
# pull user and artist playcount information, pull in artist name
user_artist <- dbGetQuery(my_db, "SELECT user_id, artist_nm, user_artist.artist_id, artist_ct
FROM user_artist
INNER JOIN artists
ON user_artist.artist_id = artists.artist_id;")
#user_artist <- dbGetQuery(my_db, "SELECT user_id, artist_id, artist_ct
# FROM user_artist;")
artists <- dbGetQuery(my_db, "SELECT artist_id, artist_nm
FROM artists;")
# db disconnect
dbDisconnect(my_db)
[1] TRUE
Let’s begin by summarizing the number of Last.fm listeners in our data set by country:
# plot top 10 countries in data set, by numer of users, horizontal bar char
ggplot(top_n(users,10,country_ct) , aes(x=reorder(country, country_ct), y=country_ct, fill=country))+
geom_bar(stat = "identity")+
coord_flip() + labs(aes(x="country", y="user ct", colour=cond))+
ggtitle("Listeners by Country")
# plot on world map
mapDevice('x11')
map_data <- joinCountryData2Map(users, joinCode="NAME", nameJoinColumn="country")
73 codes from your data successfully matched countries in the map
3 codes from your data failed to match with a country code in the map
170 codes from the map weren't represented in your data
mapCountryData(map_data, nameColumnToPlot="country_ct", catMethod=c(1:5,6:25,26:50,51:150,151:300,301:507),
colourPalette = "terrain",mapTitle="Map of Last.fm Listeners by Country")
agg_ct_ua <- user_artist %>%
group_by(artist_nm) %>%
count(artist_nm) %>%
arrange(desc(n)) %>%
top_n(5, n)
kable(agg_ct_ua)
artist_nm | n |
---|---|
Radiohead | 442 |
David Bowie | 420 |
Kanye West | 378 |
The Beatles | 300 |
Kendrick Lamar | 280 |
agg_sum_ua <- user_artist %>%
group_by(artist_nm) %>%
summarise(playct = sum(artist_ct)) %>%
arrange(desc(playct)) %>%
top_n(5, playct)
kable(agg_sum_ua)
artist_nm | playct |
---|---|
Goldfrapp | 693946 |
Sia | 244334 |
Rinrse | 90664 |
Boards of Canada | 69833 |
Kanye West | 69003 |
In this project, I am using total number of artist play counts per user as an implicit rating variable. Specifically, I am tracking each user’s Last.fm history from the most recent 12 months.
Let’s look at summary description of the play count distribution per user/artist combination:
summary(user_artist$artist_ct)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 8.0 22.0 68.2 57.0 693200.0
Clearly, the distribution is incredibly right-skewed. This is problematic, as our recommender models assume ratings per user are at least approximately normal. One possible solution is to take the log transformation of the play counts. Here is summary detail of the transformed data:
log_playct <- log(as.numeric(user_artist$artist_ct))
summary(log_playct)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.079 3.091 3.004 4.043 13.450
Now, let’s look at a before and after visual depiction of the log transformation:
before <- qplot(user_artist$artist_ct, xlab="play ct")
after <- qplot(log_playct, xlab = "log play ct")
grid.arrange(before, after, ncol=2)
The transformed data are slightly skewed, but the new distribution is a vast improvement over the raw play counts. We will proceed without further modifications.
In the code below, I massage the data for use in the recommenderlab
package. I then output the dimensions (users x artists of the ratings matrix).
# pair down user_artist df, and prepare in matrix wide format
ua_spread <- user_artist %>%
mutate(logplay = log(artist_ct)) %>%
select(user_id, artist_id, logplay) %>%
spread(artist_id, logplay)
matrix_wide <- as.matrix(ua_spread[, 2:ncol(ua_spread)])
rownames(matrix_wide) <- ua_spread$user_id
# create realRatingMatrix, a data format used by recommenderlab
ratings_matrix <- as(matrix_wide, "realRatingMatrix")
ratings_matrix
1223 x 29479 rating matrix of class 'realRatingMatrix' with 106713 ratings.
In the plots below, we see that the number of listeners per artist, in many cases, is extremely low. On the hand, most users listened to 100 artists, which was the maximum number pulled per user during the initial data pull using the Last.fm API.
# number artists listened to by users
users_per_artist <- colCounts(ratings_matrix)
upa_plot <- qplot(users_per_artist)
# number of artists per listener
artists_per_user <- rowCounts(ratings_matrix)
apu_plot <- qplot(artists_per_user)
grid.arrange(upa_plot, apu_plot, ncol=2)
Now let’s examine the first 100 rows and 100 columns of the ratings matrix:
image(ratings_matrix[1:100, 1:100], main = "Visualization of Ratings Matrix")
For faster computation and removal of possible bias in our models, I will pair down the ratings matrix as follows:
Note: There are more sophisticated, algorithmic techniques for removing bias, but they are beyond the scope of this project.
ratings_matrix_redux <- ratings_matrix[,colCounts(ratings_matrix) >= 10]
ratings_matrix_redux <- ratings_matrix_redux[rowCounts(ratings_matrix_redux) >= 20]
ratings_matrix_redux
1016 x 1880 rating matrix of class 'realRatingMatrix' with 56406 ratings.
set.seed(1)
train_boolean <- sample(x = c(TRUE,FALSE), size = nrow(ratings_matrix_redux),replace=TRUE, prob = c(0.8,0.2))
training_data <- ratings_matrix_redux[train_boolean,]
test_data <- ratings_matrix_redux[!train_boolean,]
Training data dimensions:
training_data
813 x 1880 rating matrix of class 'realRatingMatrix' with 44914 ratings.
Test data dimensions:
test_data
203 x 1880 rating matrix of class 'realRatingMatrix' with 11492 ratings.
Fit model:
recmodel_ibcf <- Recommender(data=training_data, method="IBCF", parameter = list(k=30))
recmodel_ibcf
Recommender of type 'IBCF' for 'realRatingMatrix'
learned using 813 users.
Find artists that are similar to many other artists:
model_details <- getModel(recmodel_ibcf)
col_sums <- colSums(model_details$sim > 0)
# top 5 artists with highest number of similarities to other artists
top_artists <- head(sort(col_sums, decreasing = TRUE),5)
# display top 5
display_top <- subset(artists, artist_id %in% names(top_artists), select=c(artist_nm))
row.names(display_top) <- NULL
display_top
artist_nm
1 Maroon 5
2 Manic Street Preachers
3 Lily Allen
4 Jimi Hendrix
5 <U+690D><U+677E><U+4F38><U+592B>
Find most recommended items in the test data set:
# recommendations per user
num_rec <- 5
# predict
rec_predict <- predict(object = recmodel_ibcf, newdata = test_data, n = num_rec)
rec_predict
Recommendations as 'topNList' with n = 5 for 203 users.
# define recommendation vector
rec_vector <- integer()
for (i in 1:length(rec_predict)){
rec_vector <- append(rec_vector, rec_predict@items[[i]])
}
# most recommended artists
most_rec <- head(sort(table(rec_vector), decreasing=TRUE),5)
display_top <- subset(artists, artist_id %in% names(most_rec), select = c(artist_nm))
row.names(display_top) <- NULL
display_top
artist_nm
1 Beach House
2 Kasabian
3 Connie Kleppe
4 Tindersticks
5 The Moon Lay Hidden Beneath a Cloud
Fit model:
recmodel_ubcf <- Recommender(data=training_data, method="UBCF")
recmodel_ubcf
Recommender of type 'UBCF' for 'realRatingMatrix'
learned using 813 users.
Find most recommended items in the test data set:
# recommendations per user
num_rec <- 5
# predict
rec_predict <- predict(object = recmodel_ubcf, newdata = test_data, n = num_rec)
rec_predict
Recommendations as 'topNList' with n = 5 for 203 users.
# define recommendation vector
rec_vector <- integer()
for (i in 1:length(rec_predict)){
rec_vector <- append(rec_vector, rec_predict@items[[i]])
}
# most recommended artists
most_rec <- head(sort(table(rec_vector), decreasing=TRUE),5)
display_top <- subset(artists, artist_id %in% names(most_rec), select = c(artist_nm))
row.names(display_top) <- NULL
display_top
artist_nm
1 Moneybrother
2 Elohim
3 Tom Waits
4 Leonard Cohen
5 Roma Amor
Set up evaluation model:
# k folds
n_fold <- 4
# must be lower than 20, our minimun specified number of artists per user
items_to_keep <- 15
# set ratings threshold to mean of user ratings, somewhat arbitrary
rating_threshold <- mean(rowMeans(ratings_matrix_redux))
eval_sets <- evaluationScheme(data = ratings_matrix_redux, method = "cross-validation",
k = n_fold, given = items_to_keep, goodRating = rating_threshold)
Evaluate the accuracy of the model using:
eval_reco <- Recommender(data = getData(eval_sets, "train"), method = "IBCF", parameter = NULL)
# recommend 5 items
items_to_rec <- 5
eval_pred <- predict(object = eval_reco, newdata = getData(eval_sets, "known"), n = items_to_rec,
type = "ratings")
# determine model accuracy
eval_accuracy <- calcPredictionAccuracy(x = eval_pred, data=getData(eval_sets,"unknown"), byUser=FALSE)
eval_accuracy
RMSE MSE MAE
1.2618898 1.5923658 0.9333358
Use the same measures to evaluate the UBCF model:
eval_reco <- Recommender(data = getData(eval_sets, "train"), method = "UBCF", parameter = NULL)
# recommend 5 items
items_to_rec <- 5
eval_pred <- predict(object = eval_reco, newdata = getData(eval_sets, "known"), n = items_to_rec,
type = "ratings")
# determine model accuracy
eval_accuracy <- calcPredictionAccuracy(x = eval_pred, data=getData(eval_sets,"unknown"), byUser=FALSE)
eval_accuracy
RMSE MSE MAE
0.9416992 0.8867975 0.7255586
We see that the UBCF model outperforms the IBCF model for each of the three measures.
Finally, I will wrap up this project by recommending artists to myself, using my personal, Last.fm data:
In the scripts below, I retrieve my listening history, and store in the proper realRatingsMatrix format. Also, I output my top five artists in terms of play counts from the last 12 months.
# pull information using Last.fm API
my_key <- "1ba315d4d1673bbf88aed473f1917306"
user.getTopArtistsMod <- function(user, key=my_key, period="12month", limit=100, page=1){
base_url <- "http://ws.audioscrobbler.com/2.0/?method="
method <- "user.gettopartists"
user_string <- "&user="
key_string <- "&api_key="
format_string <- "&format=json"
period_string <- "&period="
limit_string <- "&limit="
page_string <- "&page="
return (fromJSON(paste0(base_url,method, user_string,user,key_string,key,format_string,
period_string, period,limit_string,limit,page_string,page)))
}
# massage data into appropriate form for realRatingsmatrix
my_artists <- user.getTopArtistsMod('spitakiss')$topartists$artist[c("name","playcount","mbid")]
my_artists$playcount <- as.integer(my_artists$playcount)
names(my_artists) <- c("artist_nm","playcount","mbid")
# display my top artists in last 12 months:
my_artists %>%
select(artist_nm,playcount) %>%
arrange(desc(playcount)) %>%
top_n(5,playcount)
artist_nm playcount
1 Andy Shauf 58
2 Caribou 38
3 Stephen Steinbrink 36
4 The Posies 27
5 Palace Winter 23
# perform additonal manipulations to get data in proper matrix format
my_artists <- my_artists %>%
inner_join(artists, by="artist_nm") %>%
select(artist_id, playcount)
rating_columns <- data.frame(as.integer(colnames(ratings_matrix_redux)))
names(rating_columns) <- "artist_id"
my_artists <- left_join(rating_columns, my_artists, by="artist_id")
my_artists <- my_artists %>%
spread(artist_id,playcount)
matrix_wide <- as.matrix(my_artists)
my_ratings_matrix <- as(matrix_wide, "realRatingMatrix")
Predictions from IBCF model:
rec_predict <- predict(object = recmodel_ibcf, newdata = my_ratings_matrix, n = num_rec)
rec_vector <- rec_predict@items[[1]]
rec_vector <- integer()
for (i in 1:length(rec_predict)){
rec_vector <- append(rec_vector, rec_predict@items[[i]])
}
# my recommended artists, IBCF
most_rec <- sort(table(rec_vector), decreasing=TRUE)
display_top <- subset(artists, artist_id %in% names(most_rec), select=artist_nm)
rownames(display_top) <- NULL
display_top
artist_nm
1 Lykke Li
2 Angel Olsen
3 Tyler, the Creator
4 Run the Jewels
5 Edith Sderstrm
Predictions from UBCF model:
rec_predict <- predict(object = recmodel_ubcf, newdata = my_ratings_matrix, n = num_rec)
rec_vector <- rec_predict@items[[1]]
rec_vector <- integer()
for (i in 1:length(rec_predict)){
rec_vector <- append(rec_vector, rec_predict@items[[i]])
}
# my recommended artists, UBCF
most_rec <- sort(table(rec_vector), decreasing=TRUE)
display_top <- subset(artists, artist_id %in% names(most_rec), select=artist_nm)
rownames(display_top) <- NULL
display_top
artist_nm
1 Lykke Li
2 Angel Olsen
3 Elliott Smith
4 Tyler, the Creator
5 Leonard Cohen
RLastFM
package and how to implement a variety of its built-in functions.recommenderlab
package. I relied on this book heavily for both fitting and evaluating the two recommender models.