Final Project Goal


The goal for the final project is for you to build out a recommender system using a large dataset (ex: 1M+ ratings or 10k+ users, 10k+ items. There are three deliverables, with separate dates:
[1] Planning Document Find an interesting dataset and describe the system you plan to build out. If you would like to use one of the datasets you have already worked with, you should add a unique element or incorporate additional data. (i.e. explicit features you scrape from another source, like image analysis on movie posters). The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered. There is no preference for one over the other, as long as the recommender works!
[2] Implementation. In this final project deliverable, you’ll build out the system that you describe in the planning document.

Final Project Dataset

The dataset that will be used for the final project will be the the books dataset by zygmuntz. This dataset contains 6 million ratings for ten thousand of the most popular books. Information included in this dataset include books marked to be read by the users, metadata for books (author, year, etc.) and tags/shelves/genres. The dataset for this project complies with the requirement for having enough users and items to develop a robust recommender system. The link to the dataset can be found on href = https://github.com/zygmuntz/goodbooks-10k> github here.

Project planning


The project will be structured such that the dataset will be imported from github and transformed into a usable format for our recommendation engines. The project will be done in RStudio with “recommenderlab” being the primary package for developing and evaluating the model.

The steps learned throughout this course will be applied to the dataset including sampling, splitting, and modeling techniques. The recommender models that will be used are IBCF (Pearson, Jaccard, & Cosine), UBCF (Pearson, Jaccard, and Cosine), SVD (center, Z_score, no normalization), and ALS (center, z_score, no_normalization). A distributed system will be used to model the systems with spark. The sparklyr package will be used for this. because metadata is also included, an attempt at cold-start recommendations based on metadata may be attempted.

The models will be evaluated using model accuracy metrics such as RMSE, MAE, MSE, ROC, and AUC and this will be done iteratively to assess accuracy along a range of model tuning modifications. A measure of each model’s time for training and prediction will also be measured for a feedback assessment.