Summary of interesting points
- Spotify has about 40M audio files in its data store and is very huge definitely requiring an automated recommender system
- ALS - Applied Alternate Least Squares method
- Scaling with Hadoop - Applied implicit factorization, but ran into IO overhead troubles
- Scaling with Spark - Applying Gridify, Half Gridify techniques which run much faster than Hadoop
- Speed: Half gridify is the fastest techniques which delivered results when tested with a huge dataset
- Learnings - PairRDDFunctions to group by particular data and assign nodes to work on it
- Learnings - Better write own or use kryo serializers vs regular java serializers
- Learnings - Running with larger datasets results in failed executors