For this discussion item, please watch the following talk YouTube URL and summarize what you found to be the most important or interesting points. The first half will cover some of the mathematical techniques covered in this unit’s reading and the second half some of the data management challenges in an industrial-scale recommendation system.
I found the explanation of ways to find good recommendations interesting.
Pandora is doing with the music genome project. They have music experts who tag a bunch of cataloges.
Looking at what users are listeing to and analyzing that and finding relationships and recommending music based on that.
For example Netflix have bunch of movies and bunch of cataloges and those users have rated some subset of the movies and goal is to predict how users will rate new movies, so that those movies will be recommended whcih are going to be rated highly.
1) How SPARK helps with the I/O overhead:
By loading the ratings matrix into memory and there is no requirement to reread from disk for every iteration. Loading into memory, cache it, join it to where the ratings are cached and keep performing the iterations.
Splits the data into key value pairs and all the PairRDDFucntions helps to work on individual nodes.
Kryo serialization is faster than java serialization but may require you to write and/or register your own serializers.
Running with larger datasetes often results in failed executors and job never fully recovers.