For my project, my plan is to build a music recommender using content filtering.

My Data set will be the Last.FM data set found at this link. This data set contains 1892 users, and 17,632 artists.

The artist data contains only a link to the last.fm profile page, so I will need to either scrape that, use Last.fm’s api, or use a different source to build out features of the artist data set in order to use a content based approach.

Ideally, I would like to use a handwritten matrix factorization algorithm and deliver the production system in a distributed fashion (most likely data bricks), but I am hesitant to commit to both.