Instruction

Please complete the research discussion assignment in a Jupyter or R Markdown notebook. You should post the GitHub link to your research in a new discussion thread.

For this discussion item, please watch the following talk and summarize what you found to be the most important or interesting points. The first half will cover some of the mathematical techniques covered in this unit’s reading and the second half some of the data management challenges in an industrial-scale recommendation system.

Youtube link: http://www.youtube.com/watch?v=3LBgiFch4_g

Sportify Collaborative Filtering System

Music Recommendations at Scale with Spark - Christopher Johnson (Spotify)

Collaborative Filtering System

Spotify uses Collaborative filtering to analyze what users are listening to and recommend songs. Spotify uses the behavior and that of similar users. It uses “nearest neighbors” to make predictions about what other users might enjoy.

This is similar to Netflix’s model, but Spotify’s engine is not powered by star ratings. Spotify must use implicit feedback signals like stream counts to infer what we like.

This is similar to Netflix’s model, but Spotify’s engine is not powered by star ratings. Spotify must use implicit feedback signals like stream counts to infer what users like.

Spotify have implicit data and uses binary labels such as 1s and 0s where 1 = streamed and 0 = never streamed and minimizes weighted RMSE using a function of total streams as weights.

A caption

A caption

Sportify also relies on Spark organize the rating Matrix and complete the broadcast; once the vectors are joined with the ratings partition, each partition has enough information to solve the optimal user vector without any additional shuffling.

A caption

A caption

According o the presentation, the Spark Half Gridify was the most efficient; However, it has challenges with very large dataset, it could only run on 20% of the data set available at Sportify.