Discussion 2

Summary of "Music Recommendations at Scale with Spark - Christopher Johnson (Spotify)

Spotify is a Swedish music streatming and media services provider and a lot of their systems revolve around recommender systems to keep users engaged.

A Youtube presentation from Chris Johnson Machine Learning Guy discusses how they do recommendation systems at spotify and some of the challenges theyve grown from throughout their learned experiences.
Spotify provides recommendations via many different pathways and a few of these include:
- Discover (personalized recommendations)
- Radio
- Related artists
- Now playing

Because of their enormous music library, one may wonder, how do they sort through huge catalog to provide recommendations? and how do they find good recommendations?
Chris describes in the video that They use audio content (aqcuired echonets to be able to do this), metadata, text analysis and collaborative filtering (the historic approach). This involves looking at what users are listening to and finding relationships to make recommendations based on that. the idea behind collaborative filtering can be described by Chris’s example: Guy 1 says he like P,Q,R,S, Guy 2 likes Q,R,S,T. we can recommend the songs that each user has not seen that the other user has since they share the other songs in common. In this case, recommending song T to Guy 1 and song P to Guy 2.

Spotify uses implicit data and they try to implicitly decide what users like based on what they are listening to. They use binary classifications to see what users track, 1 with what they stream, 0 with everything else. A loss function will be used to weight songs based on how often they are listened to. They alternate back and forth solving for user vectors and song vectors until they converge on a recommendation operating similar to ridge regression.

Scaling up implicit matrix factorization

As of 2014 they had 700 nodes in their data center. Matrix factorization with hadoop. each block only refers to a subset of users and songs. e.g. all users with 0 and rating, placed in first block and you aggregate a bunch of terms. Issue with Hadoop, because it is iterative algorithm, you need to continue reading and writing from disk causing an I/O bottleneck. With spark, you can load the matrix into memory, cache it and continue performing iterations.
Their full gridify method allows them to send a copy of each rating block to specific workers and cache them and leave them there. Then you can compute Y transpose Y and figure out which block needs the item vector produced. once you compute all the intermediate terms, you shuffle them around and iterate and solve for optimal user vectors.
pros - ratings are cached, each partition only requires a subset of items, potentially requires less local memory
cons - sending lots of intermediate data over wire, more io overhead than half gridify.

The half gridify method paritions rating matrix into k users and item blocks, partition and caching them. all the ratings for a given user are in the same block and cuts down on having to group by the different users. If you cant fit all your item vectors or users in memory this method may have some constraints.
pros- ratings get cached and never shuffled, once item vectors are joined with ratings partitions, each partition has enough information to solve optimal user vectors.
cons - Each partition could potentially require a copy of each item vector, potentially requires more memory than full gridify scheme

Method Comparison:
With 4 million users and 500k artists, using 40 latent factors. Spark used 200 executors with 20G containers. Hadoop used 1k mappers and 300 reducers, the running times were as follows:
Hadoop too 10 hours,
Spark (full gridify took 3.5 hours),
Spark (half gridify took 1.5 hours)

spark Learnings: pairRDDFunctions are one of the great functions that allow you to split your data into key valued pairs, and allows you to do things like group by key and create different partitioners. kryo serialization is faster than java but they had to write and register their own serializers. Running larger datasets often result in failed executors and job never fully recovers. spend a lot of time trying to tune things because of this.