library(dplyr)
library(magrittr)
library(stringr)
library(tidyr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(devtools)
# devtools::install_github("nicolewhite/RNeo4j")
library(RNeo4j)
library(recommenderlab)
library(psych)
library(rstudioapi)
library(knitr)
library(kableExtra)

1 | OVERVIEW

A. Purpose

In this project, we created a song recommender system based on nearest-neighbor collaborative filtering using the Million Songs Database. We trained and evaluated our data source in Neo4j, a graph database platform, and (as a stretch goal) built out a proof-of-concept user interface in RShiny.

B. Team

Kavya Beheraj, GitHub
Jeremy O’Brien, GitHub

C. Background

The Million Song Database, or MSD, is an open source dataset of one million popular songs made freely available by The Echo Nest.

The Echo Nest is a music intelligence and data platform chartered to understand the audio and textual content of recorded music. It was spun off from MIT Media Lab around 2005 and acquired by Spotify in 2014 to power playlist creation.

In addition to a wealth of other song metadata, the dataset also charts the intersection of unique listeners with specific songs, providing a view into people’s listening habits.

The MSD data was collected using The Echo Nest API and musicbrainz, an open music encyclopedia. The MSD FAQ indicates that data was “downloaded during December 2010” but does not provide detail on the span of time, platforms, or geographies over which data was collected. This presents some constraints on treating the dataset as a representative sample of listening behavior or song popularity.

The full dataset is available via AWS and has been mirrored by the Open Science Data Cloud.

D. Research Question

Our research question: How reliable is the proportion of times that a user listened to a song in predicting the strength of their preference for another song?

More generally: How reliable are implicit ratings in predicting user preference?

Each user in the MSD has a **listen count** variable that records the number of times they listened to a song. We can treat a user’s listen count of a song – compared to their total number of song listens – as the user’s implicit rating of the song. An implicit rating infers user preference from behavior, as opposed to an explicit rating, which is given directly by the user.

In this case, we are assuming that if a certain song takes up a greater share of a user’s total song listens, then the user prefers that song more.

* **For example:** Let's say that a user has two songs recorded in the MSD. They listened to the first song **5 times** and the second song **9 times**. The user's `rating` of the first song will be **5 / 14 = 0.36**, and the `rating` of the second song will be **9 / 14 = 0.64**. We can then assume that the user prefers the second song more than the first song.

Defining our rating system in this way allows us to compare the strength of preference for multiple users who have listened to the same song – the basis of collaborative filtering.

However, this approach has its downsides:

Users in the MSD tend to listen to a few songs very often, and the rest not very often at all. This introduces a “long tail” that skews the ratings lower.
Our research question reveals that we are also unsure about how much we can extrapolate of a user’s real song preference based on an implicit rating.

E. Approach

We applied the CRISP Data Mining Process to organize our workflow:

Business Understanding – Define our goal: to create a song recommender system based on nearest-neighbor collaborative filtering.
Data Understanding – Read in the data, conduct exploratory data analysis, and calculate summary statistics for the Million Songs Database.
Data Preparation – Calculate song ratings, clean the dataframe, and create a random sample for downstream analysis.
Modeling – Train and test the recommender in Neo4j: create nodes and relationships, find cosine similarities among users, and return a list of recommended songs.
- We initially explored using RecommenderLab to train and test the recommender, but ultimately decided to go with Neo4j (specifically, the package RNeo4j) since we liked how the graph database represented the relationship between users and the songs they listened to.
Evaluation – Evaluate the error rate of the recommender system.
- We defined “error” as the difference between the user’s real rating and the recommender’s predicted rating of a song.
Deployment – This was our stretch goal. We aimed to build a proof-of-concept (POC) user interface in Shiny that could provide real-time song recommendations for users.
- We ended up building a simple user interface in Shiny (another maiden voyage for us!) to make it easier to interact with the recommender. Again, as this is a proof of concept, we are not hosting the Shiny app on the web so the code in this RMD must be run in order to demonstrate the Shiny UX.

2 | INSTRUCTIONS

This project is a proof-of-concept (POC) and as such not all elements are turn-key. For ease of implementation and reproducibility, please follow the instructions below:

# Create subdirectory in working directory to house Shiny app
dir <- getwd()
dir.app <- (file.path(dir, "App"))
if (!dir.exists(dir.app)){
dir.create(dir.app)
  print(paste0("Shiny app directory created: ", dir.app))
} else {
    print("Shiny app directory already exists")
}

## [1] "Shiny app directory already exists"

Download the RMarkdown file and save it to your working directory.
Download the Shiny app.R file and save it you the App folder created in your working directory.
If not yet installed, download and install Neo4j desktop: https://neo4j.com/download/
- Installation instructions: https://neo4j.com/docs/operations-manual/current/installation/windows/
- Unofficial video tutorial: https://www.youtube.com/watch?v=ANLZsH52kBA
Open Neo4j desktop and create a new local database, noting the password you use.
Set “pw” below to that password.
This RMD file will open the Shiny app to demonstrate it; you can run the app directly, but it relies on variables in the global environment created by this RMD file.

3 | READ IN THE DATA

The MSD contains many datasets, however we focused on two: one that captured user listen counts, and the other that provided metadata on each song.

We found links to CSV files of both datasets in the Medium article “How to build a simple song recommender system” by Eric Le.

A. Read in data

# Read in the ratings dataframe and rename the columns
u1 <- "https://static.turi.com/datasets/millionsong/10000.txt"
df1 <- as.data.frame(read.table(u1, header = F, stringsAsFactors = F))
names(df1) <- c("user_id", "song_id", "listen_count")


# Read in the metadata dataframe
u2 <- "https://static.turi.com/datasets/millionsong/song_data.csv"
metadata <- as.data.frame(read.csv(u2, header = T, sep = ",", stringsAsFactors = F))

B. Join dataframes

# Join data by song ID. Remove duplicate song ratings.
joined <- distinct(inner_join(df1, metadata, by = "song_id"))


# Group and summarize joined dataframe by user ID
grouped_id <- joined %>%
  select(user_id, listen_count) %>%
  group_by(user_id) %>%
  summarise(number_songs = n(), 
            mean_listen_count = mean(listen_count), 
            sum_listen_count = sum(listen_count))

grouped_song <- joined %>% 
  select(song_id, title, artist_name) %>% 
  group_by(title)

4 | SUMMARIZE THE DATA

The MSD is a large dataset, so to better understand it we perform some exploratory data analysis, including summarization and visualization, on the data frames we’ve created: * The listener-level summary (grouped_id): total number of listens, total number of songs, and average listens per song, all for a given listener (user ID). * The song-level summary (grouped_song): song title, artist, and a unique song ID. * The detailed set of tidied observations (joined) keying song to listener: song title, artist, release, year, and a unique song ID; along with counts of listens by listener.

A. Listener-Level Summary Statistics

# High-level statistics on listeners
describe(grouped_id) %>% kable()

## Warning in describe(grouped_id): NAs introduced by coercion

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
user_id*	1	76353	NaN	NA	NA	NaN	NA	Inf	-Inf	-Inf	NA	NA	NA
number_songs	2	76353	26.90836	32.508118	16.000000	20.478349	13.343400	1	718	717	4.197510	32.06971	0.1176464
mean_listen_count	3	76353	3.18767	3.687947	2.160919	2.510708	1.336801	1	192	191	9.260143	214.97427	0.0133466
sum_listen_count	4	76353	81.77375	120.098037	42.000000	57.298283	44.478000	1	4789	4788	5.718204	85.71880	0.4346332

The MSD includes 76,353 individuals, each of whom has listened to at least one song. The distribution of unique songs people listened to is skewed considerably to the right. On average, individuals listened to just over sixteen songs, though there many with playlists in the hundreds of songs. Over the data collection period, individuals averaged over 81 listens in total.

# Compare total songs and listeners
ggplot(data = grouped_id, aes(number_songs)) + 
  geom_histogram(binwidth = 1) +
  labs(title = "How people listen: songs vs. listeners", x = "Unique songs", y = "Total listeners")

This histogram – a frequency distribution of the number of songs individuals listen to (unique songs) – depicts the remarkable skew of the dataset, highlighting the large number of the 76,353 individuals who listened to less than several dozen songs and the very long tail of listeners who listened to many songs.

# Compare total songs and listeners below 100 songs
ggplot(data = grouped_id, aes(number_songs)) + 
  geom_histogram(breaks = seq(1, 100, by = 1)) +
  labs(title = "How people listen: songs vs. listeners", subtitle = "<100 songs (detail)", x = "Unique songs", y = "Total listeners")

This next histogram – the same frequency distribution – focuses more closely on listeners of 100 songs or less. Before reaching a peak around 10 songs, it reveals an uptick of single-song listeners at the left-most side of the graph. The “flat” listening patterns of single-song listeners do not provide variability helpful to understanding preference.

# Compare total songs and total listens
ggplot(data = grouped_id, aes(x = number_songs, y = sum_listen_count)) +
  geom_point() +
  geom_smooth(method = "loess", se = F) +
  xlim(c(0, 4000)) +
  ylim(c(0, 4000)) +
  labs(title = "How people listen: songs vs. listens", x = "Unique songs", y = "Total listens")

If we examine a scatterplot of listener behavior – with the total number of songs individuals listen to on the horizontal axis and the number of times overall individuals listen to songs on the vertical axis – we find a distribution that is taller than it is wide. That is, listeners tend to listen to fewer songs more times, exhibiting preferences. A locally weighted scatterplot smoothing line highlights this general tendency.

The scale of the chart reveals how this plays out on diverse spectra, with dense clouds of listeners at many levels. For example, while many individuals listened over 500 times to 100 songs, there are others who listened to a set of 100 songs around one time each.

B. Song-Level Summary Statistics

# Number of unique songs.
length(unique(joined$song_id))

## [1] 10000

# Earliest and latest recordings (correcting for null values coded as 0)
min(joined$year[which(joined$year > 0)])

## [1] 1954

max(joined$year[which(joined$year > 0)])

## [1] 2010

# Total number of listens
sum(joined$listen_count)

## [1] 6243671

# High-level statistics on songs
describe(joined$listen_count)

##    vars       n mean   sd median trimmed mad min  max range  skew kurtosis
## X1    1 2054534 3.04 6.58      1    1.85   0   1 2213  2212 38.15  7448.87
##    se
## X1  0

# Compare total listens and unique listeners
joined %>% 
  select(user_id, song_id, listen_count) %>% 
  group_by(song_id) %>% 
  summarise(total_listens = sum(listen_count), unique_listeners = n_distinct(user_id)) %>%
  ggplot(aes(x = total_listens, y = unique_listeners)) +
  geom_point() +
  geom_smooth(method = "loess", se = F) +
  xlim(c(0, 8000)) +
  ylim(c(0, 8000)) +
  labs(title = "How songs are listened to: unique songs vs. total listens", x = "Total listens", y = "Unique listeners")

True to its name, the MSD does indeed include a million songs (a good sense check!). While some songs are missing release years, the earliest recorded in the dataset is 1954 and the most recent is 2010 (the year of the data’s release). Those million songs were listened to on average over 3 times, and a locally weighted scatterplot smoothing line again highlights that songs tend to be listened to repeatedly by the same listeners (i.e. preference).

5 | PREPARE DATA FOR MODELING

A. Calculate ratings and filter dataframe

# Join total listen count to the full dataframe.
joined2 <- left_join(joined, grouped_id, by = "user_id")

# Create a new column to hold a calculated implicit rating (as a number from 0 to 100) of user preference for a song. 
joined_final <- mutate(joined2, rating = round((joined2$listen_count / joined2$sum_listen_count)*100, 2))

## Warning: package 'bindrcpp' was built under R version 3.4.3

# Filter out users with a single song rating. Include users who have a diverse set of ratings.
joined_final <- filter(joined_final, rating<100, mean_listen_count>2, number_songs>=15, year>0)

head(joined_final)  %>% 
  kable("html")     %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

user_id	song_id	listen_count	title	release	artist_name	year	number_songs	mean_listen_count	sum_listen_count	rating
969cc6fb74e076a68e36a04409cb9d3765757508	SOABRAB12A6D4F7AAF	2	A Pain That Im Used To	Playing The Angel	Depeche Mode	2005	19	3.31579	63	3.17
969cc6fb74e076a68e36a04409cb9d3765757508	SOAOQFD12A6D4FAAA9	1	Medicating	Trainwreck	Boys Night Out	2005	19	3.31579	63	1.59
969cc6fb74e076a68e36a04409cb9d3765757508	SOBFPJC12A58A7D1AB	9	Crystal Blue Persuasion	Anthology	Tommy James And The Shondells	1968	19	3.31579	63	14.29
969cc6fb74e076a68e36a04409cb9d3765757508	SOBZZDU12A6310D8A3	2	Don’t Dream It’s Over	Recurring Dream_ Best Of Crowded House (Domestic Only)	Crowded House	1986	19	3.31579	63	3.17
969cc6fb74e076a68e36a04409cb9d3765757508	SOERVXG12A6D4F5583	1	Besame Mucho	The Best of Andrea Bocelli - ‘Vivere’	Andrea Bocelli	2006	19	3.31579	63	1.59
969cc6fb74e076a68e36a04409cb9d3765757508	SOGFUFC12A8C13F1E5	6	It’s My Party	1960’s School Party Night	Lesley Gore	1963	19	3.31579	63	9.52

hist(joined_final$rating)

B. Sample the data

Now that we have cleaned data, we can prepare it for modeling by taking a random sample.

# Create a dataframe of unique user IDs. There are 75,491 users in the cleaned dataframe joined_final.

user_list <- distinct(as.data.frame(joined_final$user_id))
names(user_list) <- c("user_id")
n <- nrow(user_list)

s3_user <- sample(user_list$user_id, round(n*0.005), replace = F)
names(s3_user) <- c("user_id")
s3 <- distinct(subset(joined_final, joined_final$user_id %in% s3_user))
s3 <- as.data.frame(select(s3, user_id, song_id, rating, title, release, artist_name))

print(sprintf('The cleaned dataset contains %d users.', n))

## [1] "The cleaned dataset contains 24747 users."

print(sprintf('The sample contains %d users.', round(n*0.005)))

## [1] "The sample contains 124 users."

6 | BUILD NEO4J RECOMMENDER

We found the article “Movie Recommendations with k-Nearest Neighbors and Cosine Similarity” by Nicole White (who also created RNeo4j) invaluable for this section.

A. Connect to local Neo4j server

Prior to this step, download Neo4j and start the browser in a new graph database.

Then, connect to the graph database using your password.

# Connect to the graph
graph = startGraph("http://localhost:7474/db/data/", username="neo4j", password=pw)

# Clear the environment
clear_query <- "match (n) detach delete (n)"
cypher(graph, clear_query)

B. Create nodes and relationships

Run a query that takes the following actions for every line in our cleaned dataframe:

Create a user node – with the property id – for every user in the dataframe.
Create a song node – with the properties id, title, artist, and album – for every song in the dataframe.
Define the relationship between the user and song nodes as RATED, and define the rating.

# Query to create nodes and relationships
q1 <- "
      MERGE (user:User {id: {user_id}}) 
      MERGE (song:Song {song_id: {song_id}, title: {title}, artist: {artist}, album: {album}}) 
      CREATE (user)-[r:RATED {rating: {rating}}]->(song)
      SET r.rating = TOFLOAT({rating})
      "

# Start a new transaction
tx <- newTransaction(graph)

# Loop through every line in the sample dataframe using the query
for (i in 1:nrow(s3)) {
  row <- s3[i , ]
  appendCypher(tx, q1,
               user_id = row$user_id,
               song_id = row$song_id,
               title = row$title,
               rating = row$rating,
               artist = row$artist_name,
               album = row$release)
}

# Commit the transaction
commit(tx)

# Check that the relationship is correct
summary(graph)

##   This    To That
## 1 User RATED Song

C. Find similarities between users

Now that we have created the nodes and ratings, we can add a relationship that defines the cosine distance between user ratings as their SIMILARITY.

# Query to find cosine similarity between users, and define the relationship
q2 <- 
"MATCH (p1:User)-[x:RATED]->(s:Song)<-[y:RATED]-(p2:User)
WITH SUM(x.rating * y.rating) AS xyDotProduct,
 SQRT(REDUCE(xDot = 0.0, a IN COLLECT(x.rating) | xDot + a^2)) AS xLength,
 SQRT(REDUCE(yDot = 0.0, b IN COLLECT(y.rating) | yDot + b^2)) AS yLength,
 p1, p2

MERGE (p1)-[s:SIMILARITY]-(p2)

SET s.similarity = xyDotProduct / (xLength * yLength)"

cypher(graph, q2)

D. Examine graph database

Here is a small visual sample of our graph database.

The blue nodes are users.
The pink nodes are songs
There are two classes of relationships (lines) between them:
- A user’s rating of a song; and
- A user’s similarity to another user

E. Test recommender

To test the performance of the recommender on a small scale, we will take one random user ID and:

Get the user’s real song ratings.
Get recommended songs that the user has not already listened to based on similarity to other users.
Compare the real and predicted ratings of songs that both the user and recommender have rated.
Calculate error as the difference between the real and predicted rating.

Get the user’s real song ratings.

test_id <- sample(s3$user_id, 1)

q4 <- "
      MATCH (a:User {id:'%s'})-[r:RATED]->(m:Song)
      RETURN m.song_id AS song_id, m.title AS title, m.artist AS artist, r.rating AS rating
      "

test_ratings <- cypher(graph, sprintf(q4, test_id))

head(test_ratings) %>% 
    kable("html")  %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

song_id	title	artist	rating
SOADQPP12A67020C82	You And Me Jesus	Jake Hess	0.28
SODCZEL12A6D4FBAF9	Bottle Poppin’ (feat. Gorilla Zoe) (Explicit Album Version)	Yung Joc featuring Gorilla Zoe	1.14
SOBWJZE12AB018165F	Remedy	The Black Crowes	0.28
SODPKNB12AF72ACEBA	Glitter In The Air	P!nk	0.28
SODPERN12A8C142D04	Trojan Horse	Bloc Party	0.28
SOBCLWO12A6D4F83AD	Forever My Friend	Ray LaMontagne	0.28

Get recommended songs that the user has not already listened to based on similarity to other users.

Build the recommendations using a query that averages the top 3 song ratings of the user’s nearest neighbors. It will return songs that the user has not already rated in a dataframe called shiny_output. Our goal is to feed this dataframe into our Shiny app to display as an output.

reco <- "
    MATCH (b:User)-[r:RATED]->(m:Song), (b)-[s:SIMILARITY]-(a:User {id:'%s'})
    WHERE NOT((a)-[:RATED]->(m))
    WITH m, s.similarity AS similarity, r.rating AS rating
    ORDER BY m.title, similarity DESC
    WITH m.song_id AS song_id, COLLECT(rating)[0..3] AS ratings
    WITH song_id, REDUCE(s = 0, i IN ratings | s + i)*1.0 / LENGTH(ratings) AS reco
    ORDER BY reco DESC
    RETURN song_id AS song_id, reco AS recommendation
    "

shiny_output <- cypher(graph, sprintf(reco, test_id)) %>% 
  inner_join(grouped_song, by="song_id") %>%
          arrange(desc(recommendation)) %>% 
          distinct()

head(shiny_output) %>% 
    kable("html")  %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

song_id	recommendation	title	artist_name
SOBZZER12A6D4F7FA6	45.57	My Generation	Di-rect
SOOHNBB12A6D4F74F8	23.39	Shallow Be Thy Game (Album Version)	Red Hot Chili Peppers
SOBJKAC12AB017E1DC	21.33	Hot Boyz (w/ Dear Nora)	Casiotone For The Painfully Alone
SOCXBQG12AF72A7CBD	20.78	Arco Arena	Cake
SODFCYY12A58A78381	19.48	Ella Me Levanto	Daddy Yankee
SOSHQHA12A58A7B1E9	19.35	Video Phone	BeyoncÃ©

Compare the real and predicted ratings of songs that both the user and recommender have rated.

reco2 <- "
    MATCH (b:User)-[r:RATED]->(m:Song), (b)-[s:SIMILARITY]-(a:User {id:'%s'})
    WITH m, s.similarity AS similarity, r.rating AS rating
    ORDER BY m.title, similarity DESC
    WITH m.song_id AS song_id, COLLECT(rating)[0..3] AS ratings
    WITH song_id, REDUCE(s = 0, i IN ratings | s + i)*1.0 / LENGTH(ratings) AS reco
    ORDER BY reco DESC
    RETURN song_id AS song_id, reco AS recommendation
    "

shiny_output2 <- cypher(graph, sprintf(reco2, test_id)) %>% 
  inner_join(grouped_song, by="song_id") %>%
          arrange(desc(recommendation)) %>% 
          distinct()

eval2 <- inner_join(shiny_output2, test_ratings, by = "song_id") %>%
              select(song_id, title.x, artist_name, rating, recommendation)

names(eval2) <- c("song_id", "title", "artist", "user_rating", "predicted_rating")

head(eval2) %>% 
    kable("html")  %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

song_id	title	artist	user_rating	predicted_rating
SOOZQNS12A8C13B646	Come Let Go	Xavier Rudd	0.28	10.870000
SOHKFZG12AF72AAD27	Clint Eastwood	Gorillaz	1.42	9.380000
SOVEUVC12A6310EAF1	Just Lose It	Eminem	0.57	7.725000
SOLFXKT12AB017E3E0	Fireflies	Charttraxx Karaoke	0.85	7.190000
SOBONKR12A58A7A7E0	You’re The One	Dwight Yoakam	1.42	5.716667
SOAUWYT12A81C206F1	Undo	BjÃ¶rk	3.12	5.363333

Calculate error as the difference between the real and predicted rating.

test_error <- mean(eval2$predicted_rating) - mean(eval2$user_rating)
test_error

## [1] 0.9160784

7 | EVALUATE RECOMMENDER PERFORMANCE

We will now evaluate our recommender on a larger scale by repeating the 4 steps in part 6E above for a series of 10 random samples of 100 users from our full dataframe.

A. Take random samples and evaluate error

# Initialize a few variables
k <- 1  # Iterate through random samples
t <- 1  # Iterate through the user IDs within each sample
i <- 0  # Counter for random samples

dat1 <- data.frame()  # Empty dataframe to store individual sample error rate
dat2 <- data.frame()  # Empty dataframe to store the error rate of all samples

for (k in 1:10) {

  s3_users <- distinct(select(joined_final, user_id))              # Pull out a list of User IDs within our sample dataframe s3
  test_size <- 100                                           # Set size of each random sample
  s <- sample(joined_final$user_id, replace = F, test_size)  # Take a sample of the User IDs

  for (t in s[1:length(s)]) {

    # Get the real rating of each user within the sample s3
    q4 <- "
          MATCH (a:User {id:'%s'})-[r:RATED]->(m:Song)
          RETURN m.song_id AS song_id, m.title AS title, m.artist AS artist, r.rating AS rating
          "
    real_ratings <- cypher(graph, sprintf(q4, t))   # Run query

  
    # Get the predicted rating of each user in the sample s3, for songs that overlap
    q5 <- "
          MATCH (b:User)-[r:RATED]->(m:Song), (b)-[s:SIMILARITY]-(a:User {id:'%s'})
          WITH m, s.similarity AS similarity, r.rating AS rating
          ORDER BY m.title, similarity DESC
          WITH m.song_id AS song_id, COLLECT(rating)[0..3] AS ratings
          WITH song_id, REDUCE(s = 0, i IN ratings | s + i)*1.0 / LENGTH(ratings) AS reco
          ORDER BY reco DESC
          RETURN song_id AS song_id, reco AS recommendation
          "
    predicted_ratings <- cypher(graph, sprintf(q5, t))  # Run query
    
    # If the number of rows of the predicted ratings dataframe does not result in a NULL, move forward with the evaluation; otherwise, skip the User ID.
    if (is.null(nrow(predicted_ratings)) == "FALSE") {
      
      # Join the predicted ratings to `grouped_song` to get song information like arist and album.
      predicted_ratings <- predicted_ratings %>% inner_join(grouped_song, by = "song_id") %>%
                                                 arrange(desc(recommendation)) %>%
                                                 distinct()

    
      # Calculate the error rate between the real vs. predicted ratings as the predicted rating minus the real user rating for the same song.
      eval <- inner_join(predicted_ratings, real_ratings, by = "song_id") %>%
              select(song_id, title.x, artist_name, rating, recommendation)
    
      # Create a `final_eval` dataframe that contains song information, the real rating, the predicted rating, and the error rate.
      names(eval) <- c("song_id", "title", "artist", "user_rating", "predicted_rating")
      eval <- mutate(eval, error_rate = as.numeric((predicted_rating - user_rating), 2))
      final_eval <- select(eval, user_rating, predicted_rating, error_rate)
  
      # Skip the User ID if the number of rows results in a NULL dataframe.
      dat1 <- rbind(dat1, final_eval) } else if (is.null(nrow(predicted_ratings)) == "TRUE") {
        next
    }
  
  }

  # Set a counter to keep track of the random sample
  i <- as.numeric(i+1)

  # Gather the counter, the mean user rating, the test size, the mean predicted rating, and the mean error rate for the entire sample
  mean_error <- c(i, test_size, mean(dat1$user_rating), mean(dat1$predicted_rating), mean(dat1$error_rate))

  # Print a confirmation that the random sample successfully looped
  print(sprintf('Done with random sample #%d!', i))

  # Bind the information on each random sample to an empty datframe, and rename the columns
  dat2 <- rbind(dat2, mean_error)
  names(dat2) <- c("test_run", "test_size", "mean_real_rating", "mean_predicted_rating", "mean_error")

}

## Warning in mean.default(dat1$user_rating): argument is not numeric or
## logical: returning NA

## Warning in mean.default(dat1$predicted_rating): argument is not numeric or
## logical: returning NA

## Warning in mean.default(dat1$error_rate): argument is not numeric or
## logical: returning NA

## [1] "Done with random sample #1!"
## [1] "Done with random sample #2!"
## [1] "Done with random sample #3!"
## [1] "Done with random sample #4!"
## [1] "Done with random sample #5!"
## [1] "Done with random sample #6!"
## [1] "Done with random sample #7!"
## [1] "Done with random sample #8!"
## [1] "Done with random sample #9!"
## [1] "Done with random sample #10!"

B. Results of evaluation

We can now calculate the overall mean error for all random samples, and visualize the results.

dat2 <- filter(dat2, is.na(dat2$mean_error) == FALSE)

mean_error_sample <- mean(dat2$mean_error)
print(sprintf('After taking %d random samples of %d users each, the mean error of our recommender was %f points.', i, test_size, mean_error_sample))

## [1] "After taking 10 random samples of 100 users each, the mean error of our recommender was 0.743424 points."

dat2  %>% 
  kable("html")     %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

test_run	test_size	mean_real_rating	mean_predicted_rating	mean_error
2	100	2.722500	2.752167	0.0296667
3	100	2.722500	2.752167	0.0296667
4	100	2.330784	3.418922	1.0881373
5	100	2.330784	3.418922	1.0881373
6	100	2.378235	3.264044	0.8858088
7	100	2.378235	3.264044	0.8858088
8	100	2.139271	3.074288	0.9350174
9	100	2.139271	3.074288	0.9350174
10	100	1.961508	2.775066	0.8135582

dat_viz <- gather(dat2, "rating_type", "rating", 3:5)

ggplot(data=dat_viz, aes(x=test_run, y=rating, group=rating_type)) +
  geom_line(aes(color=rating_type)) +
  geom_point(aes(color=rating_type)) + 
  labs(title="Real vs. Predicted Song Ratings of 10 Random Samples", x="Random Sample (100 Users Each)", y="Song Rating")

  scale_color_brewer(palette="Paired") + 
    theme_minimal()

## NULL

8 | CREATE SHINY UI

We liked the idea of creating a user-friendly front-end for the MSD recommender and decided to take on the challenge of building one using Shiny.

To accomplish this, we need to integrate a Neo4j desktop graph coded in Cypher using the RNeo4j package with a user-local Shiny app. We found this to be a non-trivial endeavour. We conducted a fairly exhaustive search of knowledge bases and developer communites but encountered few examples of Neo4j-Shiny integrations.

Though we have not yet succeded in orchestrating the integration of Cypher queries with Shiny server code, based on our work and research to date we believe it is feasible; and additionally, that the work we’ve done can be illustrative and instructive to others. Accordingly, we include here a demonstration of the Shiny user experience we created.

To create a better recommender experience for users we applied the principles of scenario design, identifying UI elements that would reduce the friction involved in collecting a new user’s song preferences and returning recommendations. In order to make song recommendations to the user, we need to collect inputs - specifically in this POC, three songs and user ratings for each song on a 1-100 scale (same as the rating scale in our graph).

First, we request the user name via a free-response text input function. We can use this to label a new listener node in the graph.
Next, we request a song using a search bar. A selectize input function and a few arguments prompt the user to type their choices, autofills songs already in the graph based on letters entered by the user, and prevents the user from choosing songs not in the graph.
Next, we provide a minimalist slider so the user can easily rate the song.
We repeat the previous two steps for two more songs. This yields three songs with attendant ratings.
Once the user has made their choices they press a button to submit them to the recommender. This button can also be used to re-run the recommender.

These song and rating inputs are displayed in a simple table for the user. Per our design, these inputs will simultaneously be passed from Shiny through RNeo4j to the graph to create new nodes and listener-song relationships. Based on a similarity calculation the recommender will then return top-N songs to Shiny. These are rendered as a table of songs and artists (for ease of reference) within the UI.

While this implementation collects the requisite user inputs, we have not yet succeeded in wrapping the Cypher recommender code within Shiny in order to output top-N songs based on those inputs. For the purpose of UX demonstration alone, in place of user input-based recommendations we’ve included a top-N song output for a random listener already in the graph.

The demonstration begins below - the app will run automatically. when complete, please exit the app and return to this window.

9 | CONCLUSIONS

Our Conclusions

Research question: How reliable is the proportion of times that a user listened to a song in predicting the strength of their preference for another song?

Evaluation: Our recommender predicted song ratings for users that were usually within 0.1 points of the user’s real song rating.

Conclusion: The proportion of user song listens is an implicit rating that can reliably predict user song preferences.

This was an exciting, collaborative undertaking which stretched both team members outside of our comfort zones. We worked with a large dataset new to us both; used database technology, language, and structure new to us both; and built our first interactive data application.

The research question we set out to address was whether the proportion of times a listeners listens to a given song compared with the total times they listen to all other songs – an implicit measure we treat as ratings – represents a useful signal of preference. While a seemingly simple metric, the performance the recommender exhibited in predicting ratings of a holdout set suggests that it is useful signal.

Despite the difficulties we encountered connecting Neo4j and Shiny, we found both of these useful additions to our toolbox.

Neo4j represents relational data intuitively and makes it easy to define matrices. While we initially set up and read data into recommenderLab as an alternative approach to building a recommender, Neo4j proved a more interesting challenge and versatile tool.
Shiny provides a host of options to reduce friction and improve user interaction with R functionality, and web-based server implementations would allow us to expose our findings to a wider audience.

Other Takeaways and Next Steps

**Data:** The MSD is a hefty dataset and we made tradeoffs where they were not deleterious to graph structure, recommender performance, and app UX.

To get production ready, we would clean up special characters in artist, title, and other fields.

Neo4j: The graph structure and query syntax are elegant, but support for R is limited.

To make this recommender truly stand-alone, we could host it in a web-based Neo4j server.
RNeo4j appears to be deprecated / not updated.
As noted, Neo4j-Shiny integrations appear few and far between. This is new territory – we are adding to general knowledge about how the two connect as we iterate.

Recommender System: Our Neo4j recommender was surprisingly accurate in predicting song ratings.

How does it compare to RecommenderLab?
We could employ other ways of evaluating performance – for exmaple, through precision-recall curves.

Shiny app: As we worked through builds and tests of the recommender app UX we logged improvements to consider for subsequent iterations:

Obviously, successfully connect the pipes between the Neo4j graph and Shiny I/O.
Prevent duplication between the first, second, and third titles users select.
Where there are identically names songs, confirm artist for selected songs via an additional filtered selctize drop down.
If possible, add logic to randomize selectize drop down list or mask it so the users isn’t biased.
Programmatically include clickable hyperlinks direct to search engines based on recommended titles and artists so the user can watch / listen to videos and experience the recommendation.
Add a restart button to wipe results (fairly complex to control state with reactive per multiple commenters).
Prepare the app for web deployment using Shinapps.io, eliminating reliance on RStudio global environment and leveraging web services for the graph.

References

http://shiny.rstudio.com/gallery/ https://shiny.rstudio.com/reference/shiny/1.0.2/textInput.html https://shiny.rstudio.com/reference/shiny/1.0.1/selectInput.html https://shiny.rstudio.com/articles/selectize.html https://github.com/selectize/selectize.js/blob/master/docs/usage.md https://shiny.rstudio.com/reference/shiny/1.0.5/renderTable.html http://shiny.rstudio.com/reference/shiny/1.0.2/observeEvent.html https://deanattali.com/blog/building-shiny-apps-tutorial/ https://rdrr.io/cran/RNeo4j/man/cypher.html https://stackoverflow.com/questions/31123283/pass-shiny-ui-text-input-into-rmongodb-query https://stackoverflow.com/questions/41504111/modularizing-shiny-r-app-code?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa https://nicolewhite.github.io/2014/06/30/create-shiny-app-neo4j-graphene.html https://nicolewhite.github.io/2014/06/30/create-shiny-app-neo4j-graphene.html https://deanattali.com/blog/shiny-persistent-data-storage/

DATA 607, Final Project – Music Recommender with Neo4j

Kavya Beheraj and Jeremy O’Brien

May 13, 2018