Project Proposal Data 612

Prompt:

The goal for your final project is for you to build out a recommender system using a large dataset (ex: 1M+ ratings or 10k+ users, 10k+ items. There are three deliverables, with separate date. Planning Document Find an interesting dataset and describe the system you plan to build out. If you would like to use one of the datasets you have already worked with, you should add a unique element or incorporate additional data. (i.e. explicit features you scrape from another source, like image analysis on movie posters). The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered. There is no preference for one over the other, as long as your recommender works! The planning document should be written up and published as a notebook on GitHub or in RPubs.Please submit the link in the Unit 4 folder, due Thursday, July 5.

Introduction:

  • The Beer Advocate dataset provides reviews for a variety of beers over a period of more than 10 years. The dataset includes approximately 1.5 million reviews, scoring on five “aspects”: appearance, aroma, palate, taste, and overall impression. Reviews include product and user information, followed by each of these five ratings, and a plaintext review. Source:BeerAdvocate.The purpose of this project will be to implement a multi-faceted approach to beer recommendations through the use of a user-based collaborative filtering algorithm, similarity matrices, and content-based filtering.

Dataset statistics

  • Number of reviews: 1,586,259

  • Number of users: 33,387

  • Number of beers: 66,051

  • Users with > 50 reviews: 4,787

  • Median no. of words per review: 126

  • Timespan: Jan 1998 - Nov 2011

Deliverables:

  • A github repository with all data, code, interpretation, and visualization needed to understand and run the model

Projected Workflow:

  1. Data/EDA
    1. Import/ explore
    2. Pre-process/clean data
    3. Reshape the dataset to user-item matrix
    4. Generate summary statistics
    5. Center and scale data
    6. Imputation null/missing values
    7. Data visualization
  2. UBCF
    1. Implement a user-based collaborative filter model
    2. Recommender/predict
    3. Evaluate accuracy/performance
  3. SVD
    1. Implement a single value decomposition model
    2. Recommender/predict
    3. evaluate
  4. Model Optimization
    1. evaluate methods to improve model performance
  5. Finalize Model
    1. Generate “Top N” beers
  6. User-App
    1. provide some user-based beer recommender interface