Spark and Scale

I think the thing that struck me most about this talk was how much of his and his team’s time was spent improving and working on their data infrastructure rather than on exploring/modeling the data. This seems like a side to big data that I hadn’t considered and wasn’t aware of. I was also unaware of the ALS approach to matrix factorization, I had only seen stochastic gradient descent and closed form SVD before. I am interested in comparing ALS to other methods to see the pros and cons. I have a feeling that ALS will require no NA values, but will definitely research to be sure.

I also found it interesting that Pandora, one of the giants in the music recommendation space, still uses manual tagging as one of its main recommendation systems. It indicates perhaps that I am overvaluing an algorithmic approach to recommendation if some companies are still using manual methods!

Having never worked with industrial scale data before, I struggled to understand the mapreduce algorithm and the bottleneck that the speaker mentioned. However, the spark algorithms were actually very intuitive and easy to understand even without background with spark or big data. I strongly feel that I could implement said algorithm without formal training just due to the simplistic nature of the algorithms.