1 Assignment Instructions

For this discussion item, please watch the following talk and summarize what you found to be the most important or interesting points. The first half will cover some of the mathematical techniques covered in this unit’s reading and the second half some of the data management challenges in an industrial-scale recommendation system.

Music Recommendations at Scale with Spark - Christopher Johnson (Spotify)

2 Introduction

Spotify’s music streaming service is an On-Demand music streaming witha catalog of 40+ Million songs

3 Ways of Recommendations at Spotify:

Discover (personalized recommendations)
Radio
Related Artists
Now Playing

4 How to find good recommendations:

Manual Curation : Not scalable but for smaller catalog it is good as being done by Songza and Beats
Manually Tag Attributes : Not scalable but music experts manually tag all of the catalog as being by done by Pandora’s music genome project
Audio Content, Metadata, Text Analysis : Music data is analyzed from music blogs, twitter or music articles as is done by echonest, Spotify
Collaborative Filtering : Analyse music users are listening, finding relationships and recommending based on that

5 Netflix’s Explicit Matrix Factorization

Users explicitly rate a subset of movie catalog
Goal : Predict how users will rate new movies
Approximate ratings matrix by the product of low-dimensional user and movie matrices
Minimize RMSE (root mean squared error)

6 Spotify’s Implicit Matrix Factorization

Instead of explicit ratings use binary labels inferred implicitly based on what user’s are already listening to
1 = streamed even if once
0 = never streamed
Goal : Minimizing a loss function
Minimize weighted RMSE (root mean squared error) using a function of total streams as weights
Derivatives are an obvious tool for minimizing functions, the two most popular derivative-based methods are:

Alternating Least Squares (ALS)
Stochastic Gradient Descent (SGD)

6.1 Alternating Least Squares (ALS)

Alternate back-and-forth to solve the Least Square Regression

Fix songs and solve for users
Fix users and solve for songs

7 Music Recommendations at Scale with Spark

7.1 Implicit Matrix Factorization with Hadoop

7.2 Challenge : Hadoop suffers from I/O overhead

7.3 Possible Approach : Spark to the rescue

7.3.1 First Attempt

Cons:

Unnecessary shuffling all data across wire each iteration
Not caching ratings data
Unnecessary sending a full copy of user/item vectors to all workers

7.3.2 Second Attempt

Pros:

Ratings per cached and never shuffled
Each partition only requires a subset of item (or user) vectors in memory each iteration
Potentially requires less memory than a “half gridify” scheme

Cons:

Sending lots of intermediate data over wire each iteration in order to aggregate and solve for optimal vectors
More I/O overhead than a “half gridify” scheme

7.3.3 Third Attempt

Pros:

Ratings per cached and never shuffled
Once item vectors are joined with ratings partitions, each partition has enough information to solve optimal user vectors without any additional shuffling/aggregation (which occurs with the “full gridify” scheme)

Cons:

Each partition could potentially require a copy of each item vector (which may not all fit in memory)
Potentially requires more local memory than “full gridify” scheme

7.3.4 ALS Running Times

Dataset consisting of Spotify streaming data for 4 Million users and 500K artists
All jobs run using 40 latent factors
Spark jobs used 200 executors with 20G containers
Hadoop job used 1K mappers, 300 reducers

Hadoop	Spark (full gridify)	Spark (half gridify)
10 hours	3.5 hours	1.5 hours

8 Additional Random Learning

PairRDDFunctions in Spark : Split data into Key-Value pairs and RDD functions can be joined using group by partition keys
Kyro serialization faster than Java serialization but may require to write and/or register one’s own serializers
kNN vs Matrix Factorization : kNN approach cannot compute similarity between user and item but can only compute similarity between user-user or item-item to give recommendations because kNN method does not deal really well with sparse matrices, as such matrices would work better with matrix decomposition algorithms.

9 Conclusion

The entire video was very well explained along-with the underlying technology and algorithms of code flow.

The best part that interested me was the use of Spark for im-memory data processing for more efficient and real-time calculations that are horizontally scalable.

In-memory data processing really helps avoiding I/O and network round-trips and thus gives real-time data analytics avoiding the legacy batch route.

DATA 612 02 [11857] : Research Discussion Assignment 2 [06/12 - 06/18]

Music Recommendations at Scale with Spark - Christopher Johnson (Spotify)

Debabrata Kabiraj

June 18, 2019

1 Assignment Instructions

2 Introduction

3 Ways of Recommendations at Spotify:

4 How to find good recommendations:

5 Netflix’s Explicit Matrix Factorization

6 Spotify’s Implicit Matrix Factorization

6.1 Alternating Least Squares (ALS)

7 Music Recommendations at Scale with Spark

7.1 Implicit Matrix Factorization with Hadoop

7.2 Challenge : Hadoop suffers from I/O overhead

7.3 Possible Approach : Spark to the rescue

7.3.1 First Attempt

7.3.2 Second Attempt

7.3.3 Third Attempt

7.3.4 ALS Running Times

8 Additional Random Learning

9 Conclusion

10 References