Final Project Proposal

Project Objectives

The goal for your final project is for you to build out a recommender system using a large dataset (ex: 1M+ ratings or 10k+ users, 10k+ items. There are three deliverables, with separate dates:

Planning Document Find an interesting dataset and describe the system you plan to build out. If you would like to use one of the datasets you have already worked with, you should add a unique element or incorporate additional data. (i.e. explicit features you scrape from another source, like image analysis on movie posters). The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered. There is no preference for one over the other, as long as your recommender works! The planning document should be written up and published as a notebook on GitHub or in RPubs.Please submit the link in the Unit 4 folder, due Thursday, July 5.

Data

We gathered data from section “recommended for education and development” of site https://grouplens.org/datasets/movielens/. This site provides two links, from which we chose the link for the full file. Description of the data is as follows:

This dataset (ml-latest) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 27,000,000 ratings and 1,100,000 tag applications across 58,000 movies. These data were created by 280,000 users between January 09, 1995 and September 26, 2018. This dataset was generated on September 26, 2018. There are 4 *.csv files, from which we chose two files movies.cv and ratings.csv, for our down stream analysis.

Citation

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

Preview data

movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
3 Grumpier Old Men (1995) Comedy|Romance
4 Waiting to Exhale (1995) Comedy|Drama|Romance
5 Father of the Bride Part II (1995) Comedy
6 Heat (1995) Action|Crime|Thriller
7 Sabrina (1995) Comedy|Romance
8 Tom and Huck (1995) Adventure|Children
9 Sudden Death (1995) Action
10 GoldenEye (1995) Action|Adventure|Thriller
userId movieId rating timestamp
1 307 3.5 1256677221
1 481 3.5 1256677456
1 1091 1.5 1256677471
1 1257 4.5 1256677460
1 1449 4.5 1256677264
1 1590 2.5 1256677236
1 1591 1.5 1256677475
1 2134 4.5 1256677464
1 2478 4.0 1256677239
1 2840 3.0 1256677500

Combine Data

Join movies with ratings on movieId

movieId userId rating timestamp title genres
1 27273 4.0 1058580761 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 18292 4.0 851300490 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 249441 3.5 1153607940 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 224714 4.5 1130781935 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 68923 3.0 831654940 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 85589 4.0 834562996 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 23649 5.0 1462998551 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 72840 4.0 1160787966 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 100646 5.0 939802716 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 241031 4.0 854026053 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

Deliverables:

A github repository with all data, code needed to understand and run the model. The final result will be published to rpub.

Projected Workflow:

  • Data Preparation and Exploration
  • Pre-process/clean data
  • Reshape the dataset to user-item matrix
  • Generate summary statistics
  • Data Normalization
  • Data visualization
  • Split the data into training and testing
  • Implement recommendation models and train the models
  • Predict
  • Evaluate accuracy/performance
  • Evaluate methods to improve model performance
  • Finalize Model
  • Generate movies recommendation
  • Time permitting, we’ll build a ShinyApp and do it on Spark

Forhad Akbar

7/06/2020