by Chirstopher Johnson
In this video, the presenter discussed about Sportify music recommendation system.
Firstly the speaker explains the mathematical techniques - Explicit and implicit matrix factorization, and alternate least square.
Explicit matrix factorization is approximate ratings matrix by the product of low- dimensional and movie matrices. It minimize RMSE.
Implicit matrix factorization uses binary labels instead of explicit ratings. It minimized weighted RMSE using a function of total streams as weights.
Alternate least square(ALS) is to alternatively hold either user factor or item factor matrix fixed while solving for another.
Secondly, he compared the method using Hadoop vs. Spark.
Hadoop: Hadoop has challenges because it is continously reading and writing in disk, so it will affect the network ID bandsidth and it gets delayed the processing time.
Spark: Spark has advantage over Hadoop when it comes to a processing time.
Lastly, he showed three different attempts
Broadcast everything: they are unnecessarily shuffling data around every time, No cache rating data, so that there is unnecessarily sending full copy of data to workers.
Fully gridify : solved the problem with shuffling and ratings get cached. But still sending a lot of intermediate data over wire each iteration in order to aggregate and solve for optimal vectors.
Half gridify: solved the problem that sending a lot of intermediate data for aggregation and/or additional shuffling since once item vectors are joined with ratings each partition has enough information to find out the optimal user vectors.
Spark functions PairRDDFuctions - join functions
Kryo serialization is faster than java serialization
running with larger datasets often results in failed excutors and job never fully recovers.