Assignment Instructions
For this discussion item, please watch the following talk and summarize what you found to be the most important or interesting points. The first half will cover some of the mathematical techniques covered in this unit’s reading and the second half some of the data management challenges in an industrial-scale recommendation system.
Music Recommendations at Scale with Spark - Christopher Johnson (Spotify)
Introduction
Spotify’s music streaming service is an On-Demand music streaming witha catalog of 40+ Million songs
Ways of Recommendations at Spotify:
- Discover (personalized recommendations)
- Radio
- Related Artists
- Now Playing
How to find good recommendations:
- Manual Curation : Not scalable but for smaller catalog it is good as being done by Songza and Beats
- Manually Tag Attributes : Not scalable but music experts manually tag all of the catalog as being by done by Pandora’s music genome project
- Audio Content, Metadata, Text Analysis : Music data is analyzed from music blogs, twitter or music articles as is done by echonest, Spotify
- Collaborative Filtering : Analyse music users are listening, finding relationships and recommending based on that
Netflix’s Explicit Matrix Factorization
- Users explicitly rate a subset of movie catalog
- Goal : Predict how users will rate new movies
- Approximate ratings matrix by the product of low-dimensional user and movie matrices
- Minimize RMSE (root mean squared error)

Spotify’s Implicit Matrix Factorization
- Instead of explicit ratings use binary labels inferred implicitly based on what user’s are already listening to
- 1 = streamed even if once
- 0 = never streamed
- Goal : Minimizing a loss function
- Minimize weighted RMSE (root mean squared error) using a function of total streams as weights
- Derivatives are an obvious tool for minimizing functions, the two most popular derivative-based methods are:
- Alternating Least Squares (ALS)
- Stochastic Gradient Descent (SGD)
Alternating Least Squares (ALS)
Alternate back-and-forth to solve the Least Square Regression
Music Recommendations at Scale with Spark
Implicit Matrix Factorization with Hadoop
Challenge : Hadoop suffers from I/O overhead
Possible Approach : Spark to the rescue
First Attempt
Cons:
- Unnecessary shuffling all data across wire each iteration
- Not caching ratings data
- Unnecessary sending a full copy of user/item vectors to all workers
Second Attempt
Pros:
- Ratings per cached and never shuffled
- Each partition only requires a subset of item (or user) vectors in memory each iteration
- Potentially requires less memory than a “half gridify” scheme
Cons:
- Sending lots of intermediate data over wire each iteration in order to aggregate and solve for optimal vectors
- More I/O overhead than a “half gridify” scheme
Third Attempt
Pros:
- Ratings per cached and never shuffled
- Once item vectors are joined with ratings partitions, each partition has enough information to solve optimal user vectors without any additional shuffling/aggregation (which occurs with the “full gridify” scheme)
Cons:
- Each partition could potentially require a copy of each item vector (which may not all fit in memory)
- Potentially requires more local memory than “full gridify” scheme
ALS Running Times
- Dataset consisting of Spotify streaming data for 4 Million users and 500K artists
- All jobs run using 40 latent factors
- Spark jobs used 200 executors with 20G containers
- Hadoop job used 1K mappers, 300 reducers
| 10 hours |
3.5 hours |
1.5 hours |
Additional Random Learning
- PairRDDFunctions in Spark : Split data into Key-Value pairs and RDD functions can be joined using group by partition keys
- Kyro serialization faster than Java serialization but may require to write and/or register one’s own serializers
- kNN vs Matrix Factorization : kNN approach cannot compute similarity between user and item but can only compute similarity between user-user or item-item to give recommendations because kNN method does not deal really well with sparse matrices, as such matrices would work better with matrix decomposition algorithms.
Conclusion
The entire video was very well explained along-with the underlying technology and algorithms of code flow.
The best part that interested me was the use of Spark for im-memory data processing for more efficient and real-time calculations that are horizontally scalable.
In-memory data processing really helps avoiding I/O and network round-trips and thus gives real-time data analytics avoiding the legacy batch route.