For this discussion item, please watch the following talk and summarize what you found to be the most important or interesting points. The first half will cover some of the mathematical techniques covered in this unitโ€™s reading and the second half some of the data management challenges in an industrial-scale recommendation system.

http://www.youtube.com/watch?v=3LBgiFch4_g

In this clip, Christopher Johnson circa 2014 discusses the methods that Spotify uses to generate recommendations at scale. There are several recommendation methods that can be leveraged and are used by competitors:

The majority of the presentation is on Collaborative filtering techniques which analyses what users are listening to and finding relationships to generate recommendations.

Using these techniques Spotify needed a way to generate recommendations quickly for which originally Hadoop was leveraged which eventually had limitations due to I/O bottlenecks because ratings information was not cached and had to be loaded from disk multiple notes multiple times to compute recommendations. This did not scale well and as an alternative Spark was explored.

Using Spark, proved to be much more beneficial, but required more than one attempt to optimize the runtime. So as not to read/write the entire ratings matrix from disk multiple times Spark can cache it in memory. This solved the I/O overheads compared to Hadoop with it took several attempts for Spofity to optimize this. In the 1st attempt everything was broadcasted which was inefficient. The 2nd (full gridify) and 3rd (half gridify) attempts tweaked how users and ratings are grouped to optimize run time and minimize unnecessary overhead. The Hadoop method took 10 hours to run compared to Spark full gridify took 3.5 hours and half gridify took 1.5 hours.