Certain systems are going to work efficiently with small datasets But, for companies like Spotify where their catalog consists of 40 million songs a recommender system that is scalable is crucial. Christopher Johnson explains problems Spotify faced on when developing their recommender system and how cache, and reducing the shuffling around durning matrix factorization are crucial when using big datasets. By loading the ratings matrix into memory and there is no requirement to reread from disk for every iteration. Loading into memory, cache it, join it to where the ratings are cached and keep performing the iterations. The data is split into key value pairs and all the PairRDDFucntions helps to work on individual nodes.

The trials and improvements made were the most important parts mostly covered. It shows development journey of a recommender system that goes from innefficient and unscallable to more efficient, robust and better scalable. Improvements like avoiding shuffling as well as weighing the cons in doing so show us tradeoffs made when building recommender systems to improve run times when working with big data. The code provided during the presentation was very helpful to follow along in how a matrix factorization code is written.

Kryo serialization is faster than java serialization but may require you to write and/or register your own serializers.