Data 612 Discussion 2

“Summarize what you found to be the most important or interesting points.”

The main points discussed in the video that I could identify where:

The strategy of related artists
Different strategies used by different compete or equivalent companies, from manual curation to really intensive labelling
How spotify uses collaborative filtering as their primary strategy: finding relationships to what they are listening to based on the song but not further metadata
Spotify uses implicit ratings - 1s and 0s for listened to or not - but more plays makes your weight greater in their loss function
Hadoop at Spotify in 2009 was a couple computers, by 2014 was a huge network
Spark lets them read the rating matrix from disk ONCE and keep in the rest in RAM
He’s talking about RDDs, now spark support better data management structures.

This was really interesting on why experimentation is so important since there are different ways that will take you to a “good enough” result, the efficiency of achieving that result is very important, where you want the economy to come from, for example having enough RAM for the fastest methods or willing to sacrifice sometime and shuffle data “thru the cable”.

Not everything is about how fast a process runs but economics of efficiency will be led by the use case and when the information is needed. not everything needs to be answered real time, nor maybe even within the day, so planning which processes require what kind of answered will guide you of the implementation and the use of expensive resources.

Nowadays with cloud services you’re not limited on the amount or “size” of harware you can run however the fact that is available will put you in a dangerous situation of trying to answer everthing the same way and not been able to efficiently spend your operation budget.

Another important thought is the video being 2004 is quite old and showed limitations from the available libraries, languages and datastructures. Databricks the company that drives most of the OSS community deveopment for Spark has made strides on all this front, while still giving you the flexibility to manage how to distribute data and shuffling in the processors, the automation and efficiency of the libraries is amazing and way easy to implement without all the caveats explained in the video. RSpark is a nice implementation of this and several options in python. Adding multiple other services such as the ETL tool KAFKA from my perspective ise really beginnign to leave Hadoop behind but reality is that the choice is still more a personal discussion.

Data 612 Discussion 2

Sergio Ortega Cruz

June 19, 2019

“Summarize what you found to be the most important or interesting points.”