“Summarize what you found to be the most important or interesting points.”

The main points discussed in the video that I could identify where:

This was really interesting on why experimentation is so important since there are different ways that will take you to a “good enough” result, the efficiency of achieving that result is very important, where you want the economy to come from, for example having enough RAM for the fastest methods or willing to sacrifice sometime and shuffle data “thru the cable”.

Not everything is about how fast a process runs but economics of efficiency will be led by the use case and when the information is needed. not everything needs to be answered real time, nor maybe even within the day, so planning which processes require what kind of answered will guide you of the implementation and the use of expensive resources.

Nowadays with cloud services you’re not limited on the amount or “size” of harware you can run however the fact that is available will put you in a dangerous situation of trying to answer everthing the same way and not been able to efficiently spend your operation budget.

Another important thought is the video being 2004 is quite old and showed limitations from the available libraries, languages and datastructures. Databricks the company that drives most of the OSS community deveopment for Spark has made strides on all this front, while still giving you the flexibility to manage how to distribute data and shuffling in the processors, the automation and efficiency of the libraries is amazing and way easy to implement without all the caveats explained in the video. RSpark is a nice implementation of this and several options in python. Adding multiple other services such as the ETL tool KAFKA from my perspective ise really beginnign to leave Hadoop behind but reality is that the choice is still more a personal discussion.