Music Recommendations at Scale with Spark

by Chirstopher Johnson

In this video, the presenter discussed about Sportify music recommendation system.

Mathematical techniques

Firstly the speaker explains the mathematical techniques - Explicit and implicit matrix factorization, and alternate least square.

Explicit matrix factorization is approximate ratings matrix by the product of low- dimensional and movie matrices. It minimize RMSE.

Implicit matrix factorization uses binary labels instead of explicit ratings. It minimized weighted RMSE using a function of total streams as weights.

Alternate least square(ALS) is to alternatively hold either user factor or item factor matrix fixed while solving for another.

Hadoop vs. Spark

Secondly, he compared the method using Hadoop vs. Spark.

Hadoop: Hadoop has challenges because it is continously reading and writing in disk, so it will affect the network ID bandsidth and it gets delayed the processing time.

Spark: Spark has advantage over Hadoop when it comes to a processing time.

Spark - Three different attempts

Lastly, he showed three different attempts

Broadcast everything: they are unnecessarily shuffling data around every time, No cache rating data, so that there is unnecessarily sending full copy of data to workers.
Fully gridify : solved the problem with shuffling and ratings get cached. But still sending a lot of intermediate data over wire each iteration in order to aggregate and solve for optimal vectors.
Half gridify: solved the problem that sending a lot of intermediate data for aggregation and/or additional shuffling since once item vectors are joined with ratings each partition has enough information to find out the optimal user vectors.

Random learnings

  1. Spark functions PairRDDFuctions - join functions

  2. Kryo serialization is faster than java serialization

  3. running with larger datasets often results in failed excutors and job never fully recovers.