Introduction

In our final project, we are going to continue to use the MovieLens dataset. We have implemented many different Collaborative Filtering (CF) recommender systems such as user to user, item to item, and matrix factorization singular value decomposition (SVD), hybrid. With the final project, we will extract additional data and add implementation of Content Based recommender system. We will continue to create most popular collaborative filtering recommender systems and pick the one that performs the best, optimize with different values of a numeric parameter, and then compare the first user recommendations with the content based recommender generated recommendations for the first user.

As a brief overview of the Content Based (CB) recommender system, we will build for the final project; the CB recommender system will recommend items to customer “X”, similar items that is rated as by customer “X”. In order to do this, we are going to build the CB recommender system based on either genres, overview, keywords or combination of these features. The idea is to compute pairwise “cosine” similarity scores for all movies based on defined descriptions and recommend movies based on the similarity score threshold.

About the Data

The MovieLens dataset is publicly available (https://grouplens.org/datasets/movielens/latest/ or https://www.kaggle.com/rounakbanik/the-movies-dataset/data ) and changes overtime as grouplens makes it available. Small dataset is available for 100,000 ratings, 3,600 tag applied to 9,000 movies by 600 users. The full dataset has 27,000,000 ratings and 1,100,000 tag applications applied to 58,000 movies by 280,000 users. This includes tag genome data with 14 million relevance scores across 1,100 tags. As part of the goal of the assignment, we will choose a large dataset and attempt to use Spark and make sure the Recommender System we create works.

Process

In the final project, we will include all the Data Science Application Process requirements, from establishing a problem statement based on the goal of the project, data collection, data cleaning, data exploration, defining the recommender system approach, model development, evaluation and conclusion. The data cleaning and transformation phase will include necessary matrix creation and normalization, data exploration phase will include showing the distribution of variables (ratings), data preparation will include split the dataset to train and test datasets, model creation will include necessary model development and prediction and evaluation will include comparing models created with RMSE, MSE, MAE, ROC Curve and Precision Call.

Presentation

Our presentation will include the concepts we have developed and learned throughout the class and include the implementation of the recommender system we have build.