For this discussion item, please watch the following talk and summarize what you found to be the most important or interesting points. The first half will cover some of the mathematical techniques covered in this unit’s reading and the second half some of the data management challenges in an industrial-scale recommendation system.
http://www.youtube.com/watch?v=3LBgiFch4_g
One of the biggest challenges with creating a recommender system is the sheer volume of data that needs to be processed. With over 4 million users and 500,000 artists, Spotify has a mammoth-sized databank of information to sift through. Though the talk given by Christopher Johnson was given back in 2014, the findings are even more relevant in today’s data-driven world.
Spotify needed a way to create recommendations for users in a faster and more efficient manner. At the time of the video, their current process was to use Hadoop to break the data into chunks, calculate the intermediate terms, and then aggregate afterwards to solve for the optimal user-vectors. The biggest drawback was that the process required reading and writing from disk at each iteration.
The proposal for this I/O bottleneck was to utilize Spark, which loads the ratings matrix into memory and caches the data. This removes the need to keep reading and writing from disk at each iteration, ultimately saving a lot of computation time.
Johnson goes on to describe the original technique using Hadoop and 2 additional Spark-oriented processes.
As expected, run time decreased drastically from 10 hours using the original Hadoop methodology, to 3.5 hours using the full-gridify search, and finally to 1.5 hours using the half-gridify search. With the massive amount of new information that Spotify likely receives on a daily basis, it’s imperative that the recommendations be computed in as little amount of time as possible.