1 Assignment Instructions

For this discussion item, please watch the following talk and summarize what you found to be the most important or interesting points. The first half will cover some of the mathematical techniques covered in this unit’s reading and the second half some of the data management challenges in an industrial-scale recommendation system.

Music Recommendations at Scale with Spark - Christopher Johnson (Spotify)


2 Introduction

Spotify’s music streaming service is an On-Demand music streaming witha catalog of 40+ Million songs

3 Ways of Recommendations at Spotify:

  • Discover (personalized recommendations)
  • Radio
  • Related Artists
  • Now Playing

4 How to find good recommendations:

  • Manual Curation : Not scalable but for smaller catalog it is good as being done by Songza and Beats
  • Manually Tag Attributes : Not scalable but music experts manually tag all of the catalog as being by done by Pandora’s music genome project
  • Audio Content, Metadata, Text Analysis : Music data is analyzed from music blogs, twitter or music articles as is done by echonest, Spotify
  • Collaborative Filtering : Analyse music users are listening, finding relationships and recommending based on that

5 Netflix’s Explicit Matrix Factorization

  • Users explicitly rate a subset of movie catalog
  • Goal : Predict how users will rate new movies
  • Approximate ratings matrix by the product of low-dimensional user and movie matrices
  • Minimize RMSE (root mean squared error)

6 Spotify’s Implicit Matrix Factorization

  • Instead of explicit ratings use binary labels inferred implicitly based on what user’s are already listening to
  • 1 = streamed even if once
  • 0 = never streamed
  • Goal : Minimizing a loss function
  • Minimize weighted RMSE (root mean squared error) using a function of total streams as weights
  • Derivatives are an obvious tool for minimizing functions, the two most popular derivative-based methods are:
  1. Alternating Least Squares (ALS)
  2. Stochastic Gradient Descent (SGD)

6.1 Alternating Least Squares (ALS)

Alternate back-and-forth to solve the Least Square Regression

  • Fix songs and solve for users

  • Fix users and solve for songs


7 Music Recommendations at Scale with Spark

7.1 Implicit Matrix Factorization with Hadoop


7.2 Challenge : Hadoop suffers from I/O overhead

7.3 Possible Approach : Spark to the rescue

7.3.1 First Attempt

Cons:

  • Unnecessary shuffling all data across wire each iteration
  • Not caching ratings data
  • Unnecessary sending a full copy of user/item vectors to all workers

7.3.2 Second Attempt

Pros:

  • Ratings per cached and never shuffled
  • Each partition only requires a subset of item (or user) vectors in memory each iteration
  • Potentially requires less memory than a “half gridify” scheme

Cons:

  • Sending lots of intermediate data over wire each iteration in order to aggregate and solve for optimal vectors
  • More I/O overhead than a “half gridify” scheme

7.3.3 Third Attempt

Pros:

  • Ratings per cached and never shuffled
  • Once item vectors are joined with ratings partitions, each partition has enough information to solve optimal user vectors without any additional shuffling/aggregation (which occurs with the “full gridify” scheme)

Cons:

  • Each partition could potentially require a copy of each item vector (which may not all fit in memory)
  • Potentially requires more local memory than “full gridify” scheme

7.3.4 ALS Running Times

  • Dataset consisting of Spotify streaming data for 4 Million users and 500K artists
  • All jobs run using 40 latent factors
  • Spark jobs used 200 executors with 20G containers
  • Hadoop job used 1K mappers, 300 reducers
Hadoop Spark (full gridify) Spark (half gridify)
10 hours 3.5 hours 1.5 hours

8 Additional Random Learning

  • PairRDDFunctions in Spark : Split data into Key-Value pairs and RDD functions can be joined using group by partition keys
  • Kyro serialization faster than Java serialization but may require to write and/or register one’s own serializers
  • kNN vs Matrix Factorization : kNN approach cannot compute similarity between user and item but can only compute similarity between user-user or item-item to give recommendations because kNN method does not deal really well with sparse matrices, as such matrices would work better with matrix decomposition algorithms.

9 Conclusion

The entire video was very well explained along-with the underlying technology and algorithms of code flow.

The best part that interested me was the use of Spark for im-memory data processing for more efficient and real-time calculations that are horizontally scalable.

In-memory data processing really helps avoiding I/O and network round-trips and thus gives real-time data analytics avoiding the legacy batch route.