Assignment Instructions

Find an interesting dataset and describe the system you plan to build out. If you would like to use one of the datasets you have already worked with, you should add a unique element or incorporate additional data. (i.e. explicit features you scrape from another source, like image analysis on movie posters). The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered. There is no preference for one over the other, as long as your recommender works!

Dataset

The dataset I will use for this project is the Amazon - Ratings (Beauty Products) dataset which is available on Kaggle. This dataset is a compact version of the original Amazon dataset and consists of over 2 million Amazon reviews and ratings on Beauty related products. The original Amazon dataset contains 142.8 million reviews spanning from May 1996, to July 2014. The version of the dataset that I will use contains 4 columns: UserId, ProductId, Rating, and Timestamp.

Project Objective

My objective for this project is to create a recommendation system that recommends Amazon beauty products to users based on previous ratings. Due to the fact that the dataset for this project is considerably larger than those of previous projects, I will leverage Spark to store the data. In order to integrate Spark with R, I will utilize the R sparklyr package. In terms of the recommendation model, I plan to utilize the Alternating Least Square (ALS) matrix factorization algorithm.