Discussion 2 - Music Recommendations at Scale with Spark

For this discussion item, please watch the following talk and summarize what you found to be the most important or interesting points. The first half will cover some of the mathematical techniques covered in this unit’s reading and the second half some of the data management challenges in an industrial-scale recommendation system.

http://www.youtube.com/watch?v=3LBgiFch4_g

In this clip, Christopher Johnson circa 2014 discusses the methods that Spotify uses to generate recommendations at scale. There are several recommendation methods that can be leveraged and are used by competitors:

Manual Curation: Songza, Beats
Manual Tag: Pandora
Audio Content Metadata, Text Analysts: Spofigy, Echonest
Collaborative Filtering: Spofity, Last.fm

The majority of the presentation is on Collaborative filtering techniques which analyses what users are listening to and finding relationships to generate recommendations.

Explicit Matrix Factorization: approximate the ratings matrix by the product of two lower dimensional matrices. A user matrix and a movie matrix, where the goal is to minimize the root mean squared error (RMSE).
Implicit Matrix Factorization: similar to explicit factorization but rating are a binary value and the goal is to minimize the RMSE using a function of total streams of a song as the weighing factor
Alternating Least Squares (ALS): alternating back and forth fixing a latent factor and adjusting for the other and vice versa

Using these techniques Spotify needed a way to generate recommendations quickly for which originally Hadoop was leveraged which eventually had limitations due to I/O bottlenecks because ratings information was not cached and had to be loaded from disk multiple notes multiple times to compute recommendations. This did not scale well and as an alternative Spark was explored.

Using Spark, proved to be much more beneficial, but required more than one attempt to optimize the runtime. So as not to read/write the entire ratings matrix from disk multiple times Spark can cache it in memory. This solved the I/O overheads compared to Hadoop with it took several attempts for Spofity to optimize this. In the 1st attempt everything was broadcasted which was inefficient. The 2nd (full gridify) and 3rd (half gridify) attempts tweaked how users and ratings are grouped to optimize run time and minimize unnecessary overhead. The Hadoop method took 10 hours to run compared to Spark full gridify took 3.5 hours and half gridify took 1.5 hours.

Discussion 2 - Music Recommendations at Scale with Spark

Dhairav Chhatbar

7/6/2020