Instruction

For this discussion item, please watch the following talk and summarize what you found to be the most important or interesting points. The first half will cover some of the mathematical techniques covered in this unit’s reading and the second half some of the data management challenges in an industrial-scale recommendation system.

http://www.youtube.com/watch?v=3LBgiFch4_g

Response

The talk which was conducted by Mr. Chris Jones, discusses how Spotify, an on-demand music streaming service, accounts for scale in their music recommendation system.

Two Points I found interesting are:

Point One- Mathematical Techniques Used

There are various approaches for setting recommendation systems. These include: manual curation, manual tagging, text analysis of audio content and collaborative filtering. Since the first three are time intensive and becomes even more so when scaling, Spotify have opted to use Collaborative Filtering. Two approaches for implementing formalising Collaborative filtering were described: Explicit Matrix Factorisation Method and Implicit Factorisation Method. Explicit Factorisation method involves using the ratings explicity given by users for the purpose of matrix factorisation. Netflix uses this approach. On the other hand Implicit Factorisation method uses binary values to represent “ratings” from a user for the purpose of matrix factorisation. These rating values are based on inferences made on user behaviour towards a certain item, e.g. number of times viewed. Below is diagram which explains the difference between these two matrix factorisation techniques. Spotify uses Implicit Factorisation Method in its recommendation engine.

Explicit Factorisation VS Implicit Factorisation

Point Two - Data Management Challenges

In order to scale up implicit factorisation, Spotify has used Hadoop and Spark. They used Hadoop at first for up to three years. However, they found a great limitation with hadoop, I/O overlod, leading to I/O bottleneck. Spark helps overcomes this limitation and optimises processing speed. Spark achieves this by minimising disk read/write operations for intermediate results, storing these in memory and perform disk operations only when essential. The diagram below shows this differentiation between Hadoop and Spark.

Hadoop Vs Spark

As we can see, MapReduce involves at least 4 disk operations while Spark only involves 2 disk operations.

There are also challenges with optimisation. Three different attempts to optimize the whole calculations were described. In the first attempt, broadcast everything was used. This approach involved a lot of unnecessary shuffling and was also time and space consuming. In the second attempt, the full gridify approach was used. With this approach, the ratings never shuffled andless memory was required. In the third optimisation attempt, half gridify was used and worked the best of the three.

Reference

ThemeGrill. (2019, June 2). Apache Spark Vs Hadoop MapReduce - EduinPro. EduinPro. https://eduinpro.com/blog/apache-spark-vs-hadoop-mapreduce/

DATA612 Research Dicsussion Two