Objective

Find an interesting dataset and describe the system you plan to build out. If you would like to use one of the datasets you have already worked with, you should add a unique element or incorporate additional data. (i.e. explicit features you scrape from another source, like image analysis on movie posters). The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered. There is no preference for one over the other, as long as your recommender works! The planning document should be written up and published as a notebook on GitHub or in RPubs.Please submit the link in the Unit 4 folder, due Thursday, July 5.

Overview of data

Our group project consists of two datasets from GroupLens Research which collected movie rating data through the MovieLens web site (http://movielens.org). The following description of the dataset is as follows: - a) This dataset (ml-latest) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 20,000,263 ratings and 465,564 tag applications across 27,278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016. b) This dataset (ml-1m) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000. c) This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100,836 ratings and 3,683 tag applications across 9,742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018. In all datasets, users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

Approach

We will first use most popular method, UBCF, IBCF and SVD and then compare these models to see which one of them provides better results. Recommenderlab library will be used to create a recommender system. It is important not to have false results and the recommendations will need to be as accurate as possible and that can be done through model comparison to see which model is performing better. Once, the model is finalized then we will create a recommendation system in Shiny App that will provide the movies that the user may want to watch based on his/her selection of movies and genre.

Data 612 - Final Project Proposal

Group members: Habib Khan & Vijaya Cherukuri

June 30, 2020

Objective

Overview of data

Approach