Final Project Planning

Eleanor

Requirements:

The goal for the final project is for us to build out a recommender system using a large dataset (ex: 1M+ ratings or 10k+ users, 10k+ items.

Objectives:

The purpose of this project is to produce quality recommendations by extracting insights from a large dataset.

Approach:

  • I will be using Amazon Fine Food Reviews dataset that I found in Kaggle.

  • I will be using the Recommender Systems that we've learned from the course and apply the methods such as, UBCF, IBCF and SVD. I will compare these models to see which one of them will provide better results.

  • I will then try to use Spark to do a distributed processing and then compare the performance and accuracy of the recommendation between the centralized system and the distributed system.

Dataset Description:

  • This dataset consists of reviews of fine foods from Amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

Dataset Information:

alt text

Source: SNAP

  • Sample Data:
    Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time Summary Text
    1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 5 1303862400 Good Quality Dog Food I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.

Dataset Field Definition:

FieldName Field_Description
Id Unique Identifier
ProductId unique identifier for the product
UserId unqiue identifier for the user
ProfileName
HelpfulnessNumerator number of users who found the review helpful
HelpfulnessDenominator number of users who indicated whether they found the review helpful
Score rating in the range 1 and 5, with 1 being the worse and 5 being the best
Time timestamp
Summary Brief summary of the review
Text Content of the review

Distribution of Ratings:

plot of chunk unnamed-chunk-3

Reference: