Please complete the research discussion assignment in a Jupyter or R Markdown notebook. You should post the GitHub link to your research in a new discussion thread.
For this discussion item, please watch the following talk and summarize what you found to be the most important or interesting points. The first half will cover some of the mathematical techniques covered in this unit’s reading and the second half some of the data management challenges in an industrial-scale recommendation system.
Watch Video Music Recommendations at Scale with Spark - Christopher Johnson (Spotify) Duration: 26:29 User: n/a - Added: 7/17/14 YouTube URL: http://www.youtube.com/watch?v=3LBgiFch4_g
This was an interesting presentation on the Spotify recommendation system, basically showing which type of mathematical method it is used in Spotify and also on how the use of that method led them to switch from Hadoop to Spark.
Presenter made a distinction between two methods: explicit matrix factorization, which was utilized by the Netflix prize winners and implicit matrix factorization, which is utilized by Spotify. On the explicit method, the goal is to predict how users will rate new movies. On the implicit method, no such rating prediction is made but rather music is classified by streamed or never streamed binary factor. On the implicit method, Spotify uses the alternative least square to generate recommendations.
Hadoop was used in the first years of the company but it was found to be inefficient as the number of users and songs grew. Specifically, Hadoop suffered from I/O overhead in that data had to be fetched from physical hard disk drives several times during the recommendation process. To solve this scalability issue, Spotify migrated to Spark. Given the extensive use of in-memory data, instead of access to data stored in hard disk drives, Spotify’s algorithm was more suited for the Spark technology than Hadoop’s.
What Spark allowed them to do was to develop an algorithm named Gridify which accesses data in partitions or blocks instead of line by line broadcasting. In gridify, group ratings are partitioned and cached. Main cons of the “broadcast everything” approach are: unnecessarily shuffling all data across each iteration, not caching ratings data, and unnecessarily sending a full copy of user/item vectors to all workers.
Full gridify method proves to be more efficient than the “broadcast everything” method. It basically divides the group ratings matrix into smaller blocks, partitions and caches them. The so called “full gridify” works like this: for each iteration it computes YtY over time vectors and broadcasts. Then, for each item vector it sends a copy to each rating block, computes intermediate terms for each block and finally it groups by user, aggregates terms and solves for optimal user vector. Pros are: ratings get cached and never shuffled, each partition only requires a subset of item/user vectors in memory, potentially requiring less local memory than a “half gridify” scheme. As for cons: it sends lots of intermediate data over each iteration in order to aggregate and solve for optimal vectors and it requires more IO overhead than a half gridify scheme.
“Half gridify” partitions ratings matrix into user and item blocks, partition and cache. Main difference to the full method is that it doesn’t require calculation of intermediate terms for each block. Main cons is that it may require more local memory than the full method described above. It was interesting to learn that this algorithm is indeed implemented in the MLlib library.
Interesting to the discussion was the run time comparison between Hadoop and Spark’s gridify implementation. Half-gridify was the fastest of the three, taking 1.5hrs on a dataset containing 4 million users and 500k artists. Noteworthy is the fact that at that time in 2014, they were not able to run the then 40m users and 20m library in Spark.
In terms of optimization time, MLlib proved to be the fastest amongst MATLAB, Mahoot, and GraphLab.