Scale Up Recommendation System in SPOTIFY

Reference Video

This is in reference to a youtube video below: #https://www.youtube.com/watch?v=3LBgiFch4_g

Speaker: Christopher Johnson

Background information: Spotify is a music streaming service which at the time of lecture, contained a huge catalog of over 40 million songs.

How does spotify know us? Via recommendation system, of course. Historically, back in the 2000s, Songza kicked off the online music curation scene using manual curation to create playlists for users. This meant that a team of “music experts” or other human curators would put together playlists that they just thought sounded good, and then users would listen to those playlists. Manual curation worked alright, but it was based on that specific curator’s choices, and therefore couldn’t take into account each listener’s individual music taste.here are three main types of recommendation models that Spotify employs:

Collaborative Filtering models (i.e. the ones that Last.fm originally used), which analyze both your behavior and others’ behaviors. Natural Language Processing (NLP) models, which analyze text. Audio models, which analyze the raw audio tracks themselves.

Collaborative FIltering in Spotify

Spotify uses collaborative filtering recommendation engine. "Collaborative Filtering is commonly used for recommender systems. It is a technique to fill in the missing entries of a user-item association matrix. MLlib (MLlib is Sparks scalable machine learning library consisting of common learning algorithms and utilities) currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. MLlib uses the alternating least squares (ALS) algorithm to learn these latent factors

Implicit Matrix Factorization in Spotify

Explicit Matrix Factorization is when users explicitly rate a subset of the item (eg movie) catlog,with the goal to predict how users will rate new movies. It uses minimized RMSE

Implicit matrix factorization is different, instead of explicit ratins use binary lables, which minimizes weigted RMSE using a function of total steams as weights.

Implicit matrix factorization is what Spotify uses to reduce the overall size of matrix. The standard approach to matrix factorization based collaborative filtering treats the entries in the user-item matrix as explicit preferences given by the user to the item. It is common in many real-world use cases which only have access to implicit feedback (e.g. views, clicks, purchases, likes, shares etc.). The approach used in MLlib to deal with such data is taken from Collaborative Filtering for Implicit Feedback Datasets. Essentially instead of trying to model the matrix of ratings directly, this approach treats the data as a combination of binary preferences and confidence values.

Scale up by Hadoop and Spark

Next step after the implicit matrix factorization is using hadoop to scale it up. In 2014, there are about 700 nodes in spotify london data center. Hadoop sufferes from I/O overhead terribly. SPARK comes to rescue! It simplifies the process dramatically and saves run time by 90%. For example: , First attempt: broadcast everything (Hadoop) , Run time: 10 hours , Second attempt: full gridify (Spark) , Run time: 3.5 hours , Third attempt: half gridify (Spark) , Run time: 1.5 hours ,

Apache Spark is 30x faster than Hadoop, because the huge difference between RAM and Disk. Spark is a cluster-computing framework which competes more with MapReduce than with the entire Hadoop ecosystem. For example, Spark doesn t have its own distributed filesystem. Spark uses memory and can use disk for processing, whereas MapReduce is strictly disk-based. Therefore, Spark is simpler and faster than Mapreduce for the usual Machine learning and Data Analytics applications. In Full gridify scheme, a lot of intermediate data is sent over the wire for each iteration and a large I/O overhead compared to the half gridify scheme, but has a potential for requiring less local memory comapred to half gridify. For a half gridify scheme, ratings get cached and never shuffled. Once item vectors are joined with ratings partitions each partition has enough information to solve optimal user and vectors without any additional shuffling.