Movie Recommender System with Large Dataset

Objectives

The goal for your final project is for you to build out a recommender system using a large dataset (ex: 1M+ ratings or 10k+ users, 10k+ items. There are three deliverables, with separate dates:

[1] Planning Document Find an interesting dataset and describe the system you plan to build out. If you would like to use one of the datasets you have already worked with, you should add a unique element or incorporate additional data. (i.e. explicit features you scrape from another source, like image analysis on movie posters). The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered. There is no preference for one over the other, as long as your recommender works! The planning document should be written up and published as a notebook on GitHub or in RPubs.Please submit the link in the Unit 4 folder, due Thursday, July 5.

[2] Presentation. Make a five-minute presentation of your system in our final meetup on Tuesday. If you’re not able to attend the meetup, you’re responsible for either recording your presentation, or scheduling one-on-one time to deliver your presentation prior to the meetup. You should be prepared to present on Tuesday. You should use this project to showcase some of the concepts that you have learned in this course, while delivering on the (probably) less familiar Spark platform. You are welcome to submit a compelling alternative proposal (subject to approval), such as implementing a recommender system using in Microsoft Azure ML Studio or with Google TensorFlow, or building out an application of a certain complexity using another tool. You may work in a small group (2-3) on this assignment.

[3] Implementation. In this final project deliverable, you’ll build out the system that you describe in your planning document. This will be due on Thursday and must be turned in as an RMarkdown file or a Jupyter notebook, and posted to GitHub or RPubs.com.

Preamble

In our proposal, we said that we would use full file of movielense dataset from section “recommended for education and development” of site https://grouplens.org/datasets/movielens/. But, the dataset was so large that it ran out of memory, while creating the matrix. The error message is provided below.

Error: cannot allocate vector of size 113.7 Gb

So, we changed our plan and got similar data, of lower volume, from Kaggle that also fulfills the minimum requirements of the project. The Kaggle data is described below.

Data

MyAnimeList, often abbreviated as MAL, is an anime and manga social networking and social cataloging application website. The site provides its users with a list-like system to organize and score anime and manga. It facilitates finding users who share similar tastes and provides a large database on anime and manga. In 2018, MyAnimeList reported having approximately 15,000 anime and 45,000 manga entries. In 2015, the site received 120 million visitors a month.

We gathered data from Kaggle. Kaggle provides two csv. Description of the data is as follows:

This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.

Anime.csv

  • anime_id - myanimelist.net’s unique id identifying an anime.
  • name - full name of anime.
  • genre - comma separated list of genres for this anime.
  • type - movie, TV, OVA, etc.
  • episodes - how many episodes in this show. (1 if movie).
  • rating - average rating out of 10 for this anime.
  • members - number of community members that are in this anime’s “group”.

Rating.csv

  • user_id - non identifiable randomly generated user id.
  • anime_id - the anime that this user has rated.
  • rating - rating out of 10 this user has assigned (-1 if the user watched it but didn’t assign a rating).

Load Data

Preview data

anime_id name genre type episodes rating members
32281 Kimi no Na wa. Drama, Romance, School, Supernatural Movie 1 9.37 200630
5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Military, Shounen TV 64 9.26 793665
28977 Gintama° Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen TV 51 9.25 114262
9253 Steins;Gate Sci-Fi, Thriller TV 24 9.17 673572
9969 Gintama' Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen TV 51 9.16 151266
32935 Haikyuu!!: Karasuno Koukou VS Shiratorizawa Gakuen Koukou Comedy, Drama, School, Shounen, Sports TV 10 9.15 93351
11061 Hunter x Hunter (2011) Action, Adventure, Shounen, Super Power TV 148 9.13 425855
820 Ginga Eiyuu Densetsu Drama, Military, Sci-Fi, Space OVA 110 9.11 80679
15335 Gintama Movie: Kanketsu-hen - Yorozuya yo Eien Nare Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen Movie 1 9.10 72534
15417 Gintama': Enchousen Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen TV 13 9.11 81109
## Rows: 12,294
## Columns: 7
## $ anime_id <int> 32281, 5114, 28977, 9253, 9969, 32935, 11061, 820, 15335, ...
## $ name     <fct> Kimi no Na wa., Fullmetal Alchemist: Brotherhood, GintamaÂ...
## $ genre    <fct> "Drama, Romance, School, Supernatural", "Action, Adventure...
## $ type     <fct> Movie, TV, TV, TV, TV, TV, TV, OVA, Movie, TV, TV, Movie, ...
## $ episodes <fct> 1, 64, 51, 24, 51, 10, 148, 110, 1, 13, 24, 1, 201, 25, 25...
## $ rating   <dbl> 9.37, 9.26, 9.25, 9.17, 9.16, 9.15, 9.13, 9.11, 9.10, 9.11...
## $ members  <int> 200630, 793665, 114262, 673572, 151266, 93351, 425855, 806...
user_id anime_id rating
1 20 -1
1 24 -1
1 79 -1
1 226 -1
1 241 -1
1 355 -1
1 356 -1
1 442 -1
1 487 -1
1 846 -1
## Rows: 7,813,737
## Columns: 3
## $ user_id  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ anime_id <int> 20, 24, 79, 226, 241, 355, 356, 442, 487, 846, 936, 1546, ...
## $ rating   <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1...

Clean the Data

As mentioned above, if the user watched but didn’t assign a rating, then corresponding data field has -1. So, we converted the unrated data to ‘NA’, and changed the data type for downstream analysis.

## Rows: 7,813,737
## Columns: 3
## $ user_id  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ anime_id <fct> 20, 24, 79, 226, 241, 355, 356, 442, 487, 846, 936, 1546, ...
## $ rating   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## Rows: 12,294
## Columns: 7
## $ anime_id <fct> 32281, 5114, 28977, 9253, 9969, 32935, 11061, 820, 15335, ...
## $ name     <chr> "Kimi no Na wa.", "Fullmetal Alchemist: Brotherhood", "Gin...
## $ genre    <chr> "Drama, Romance, School, Supernatural", "Action, Adventure...
## $ type     <chr> "Movie", "TV", "TV", "TV", "TV", "TV", "TV", "OVA", "Movie...
## $ episodes <fct> 1, 64, 51, 24, 51, 10, 148, 110, 1, 13, 24, 1, 201, 25, 25...
## $ rating   <dbl> 9.37, 9.26, 9.25, 9.17, 9.16, 9.15, 9.13, 9.11, 9.10, 9.11...
## $ members  <int> 200630, 793665, 114262, 673572, 151266, 93351, 425855, 806...

Data Exploration

Highest rated animes

## Selecting by members
anime_id name genre type episodes rating members
5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Military, Shounen TV 64 9.26 793665
9253 Steins;Gate Sci-Fi, Thriller TV 24 9.17 673572
1575 Code Geass: Hangyaku no Lelouch Action, Mecha, Military, School, Sci-Fi, Super Power TV 25 8.83 715151
1535 Death Note Mystery, Police, Psychological, Supernatural, Thriller TV 37 8.71 1013917
16498 Shingeki no Kyojin Action, Drama, Fantasy, Shounen, Super Power TV 25 8.54 896229
4224 Toradora! Comedy, Romance, School, Slice of Life TV 25 8.45 633817
6547 Angel Beats! Action, Comedy, Drama, School, Supernatural TV 13 8.39 717796
10620 Mirai Nikki (TV) Action, Mystery, Psychological, Shounen, Supernatural, Thriller TV 26 8.07 657190
11757 Sword Art Online Action, Adventure, Fantasy, Game, Romance TV 25 7.83 893100
20 Naruto Action, Comedy, Martial Arts, Shounen, Super Power TV 220 7.81 683297

Most watched type of show

Anime with the most members

## Selecting by members
anime_id name genre type episodes rating members
1535 Death Note Mystery, Police, Psychological, Supernatural, Thriller TV 37 8.71 1013917
16498 Shingeki no Kyojin Action, Drama, Fantasy, Shounen, Super Power TV 25 8.54 896229
11757 Sword Art Online Action, Adventure, Fantasy, Game, Romance TV 25 7.83 893100
5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Military, Shounen TV 64 9.26 793665
6547 Angel Beats! Action, Comedy, Drama, School, Supernatural TV 13 8.39 717796
1575 Code Geass: Hangyaku no Lelouch Action, Mecha, Military, School, Sci-Fi, Super Power TV 25 8.83 715151
20 Naruto Action, Comedy, Martial Arts, Shounen, Super Power TV 220 7.81 683297
9253 Steins;Gate Sci-Fi, Thriller TV 24 9.17 673572
10620 Mirai Nikki (TV) Action, Mystery, Psychological, Shounen, Supernatural, Thriller TV 26 8.07 657190
4224 Toradora! Comedy, Romance, School, Slice of Life TV 25 8.45 633817

Create Matrix

## 73515 x 11200 rating matrix of class 'realRatingMatrix' with 7813730 ratings.
## [1] 73515 11200
## 99233736 bytes

Selecting the most relevant data

On exploring the data, we noticed that the table contains:

  • Ratings of the animes that have been viewed only a few times, and therefore might be biased. So, we’ll keep movies that have been watched at least 1000 times.
  • Ratings of the Users, who rated only a few movies, might be biased too. So, we’ll keep users, who have rated at least 500 anime shows.
## 1843 x 1720 rating matrix of class 'realRatingMatrix' with 967727 ratings.
## 11850056 bytes

Data Visualization

Recommendation algorithms

Split the dataset into training set (80%) and testing set (20%):

## Evaluation scheme with 4 items given
## Method: 'split' with 4 run(s).
## Training set proportion: 0.800
## Good ratings: >=5.000000
## Data set: 1843 x 1720 rating matrix of class 'realRatingMatrix' with 967727 ratings.
## Normalized using center on rows.

Item-Item Collaborative Filtering

This is a filtering method, where similarity between items is calculated using users’ ratings of items. That means the algorithm recommends items similar to the users’ previous selections. In the algorithm, the similarities between different items are computed by one of the similarity measures, and then similarity values are used to predict ratings for user-item pairs absent in the data.

Training model

In below step we’ll train the model.

## Warning in .local(x, ...): x was already normalized by row!
## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 1474 users.

Predict

## $`226`
##  [1]  173  436  507  993 1059 1107 1390 1453 1525 1569
## 
## $`392`
##  [1]  540  702  868  930  972  991 1193 1341  286  317
## 
## $`478`
##  [1]   65  638  826 1090 1371 1399 1449 1476 1536 1544
## 
## $`804`
## integer(0)
## 
## $`2632`
##  [1]  643  870 1360   43   50  107  114  163  180  222
## 
## $`3009`
##  [1] 477   1  16  18  24 135 171 181 195 217
## 
## $`3117`
##  [1]  15  22 130 132 139 158 213 358 455 464
## 
## $`3338`
##  [1]  68  74 155 262 415 721 724 818 822 844

Due to lack of historical data, sometimes the IBCF model may not recommend any items for one or more users.

## [[1]]
##                                          name  type
## 1                           Appleseed (Movie) Movie
## 2                                     Avenger    TV
## 3        Mobile Suit Gundam: The 08th MS Team   OVA
## 4  Mobile Suit Gundam 0080: War in the Pocket   OVA
## 5                                      X/1999 Movie
## 6                                           X    TV
## 7               Sen to Chihiro no Kamikakushi Movie
## 8                                    Planetes    TV
## 9               InuYasha: Guren no Houraijima Movie
## 10        InuYasha: Kagami no Naka no Mugenjo Movie
## 
## [[2]]
##                                    name  type
## 1               Rozen Maiden: Träumend    TV
## 2                  Mobile Suit Gundam I Movie
## 3                            Hi no Tori    TV
## 4                      Macross 7 Encore   OVA
## 5 City Hunter: Hyakuman Dollar no Inbou   OVA
## 6                          Busou Renkin    TV
## 7   Super Robot Taisen OG The Animation   OVA
## 
## [[3]]
##                                 name  type
## 1  Tenchi Muyou! Ryououki 2nd Season   OVA
## 2                      Slayers Great Movie
## 3              Densetsu Kyojin Ideon    TV
## 4                 Bible Black Gaiden   OVA
## 5                Usagi-chan de Cue!!   OVA
## 6          Happy Seven: The TV Manga    TV
## 7   Violence Jack: Harlem Bomber-hen   OVA
## 8                         B&#039;T X    TV
## 9 Final Fantasy VII: Advent Children Movie
## 
## [[4]]
##                                   name type
## 1                               Naruto   TV
## 2           Kidou Tenshi Angelic Layer   TV
## 3                          Arc the Lad   TV
## 4                              Chobits   TV
## 5          Basilisk: Kouga Ninpou Chou   TV
## 6 Mahou Shoujo Lyrical Nanoha A&#039;s   TV
## 7                             Shuffle!   TV
## 8                           Boys Be...   TV
## 9                      Chuuka Ichiban!   TV

Single Value Decomposition

Please refer RPubs link for our detailed explanation of SVD, which we provided in Project 3.

Training

## Warning in .local(x, ...): x was already normalized by row!
## Recommender of type 'SVD' for 'realRatingMatrix' 
## learned using 1474 users.

Predict

## $`226`
##  [1] 1569 1684 1632 1618 1665 1699 1635 1551 1679 1655
## 
## $`392`
##  [1] 1158    9 1505  505  159 1152 1429  133  458 1386
## 
## $`478`
##  [1] 1635 1569 1665 1618 1632 1699 1676 1589 1679 1600
## 
## $`804`
##  [1] 516 429 521 605 537 665 705 644  38 529
## 
## $`2632`
##  [1] 172 285 407 214 254 321  99 217 567 288
## 
## $`3009`
##  [1]  16  18 135 583   1 671 235  21  24 333
## 
## $`3117`
##  [1] 494 587 467   9 159 219 737 761 545 869
## 
## $`3338`
##  [1]  575  669  708 1158 1635  670  495  768  709 1671

As opposed to IBCF, the SVD algorithm provides a recommendation for every user. It’s a reliable practice to use SVD, to estimate missing data in a data matrix.

## [[1]]
##                                      name    type
## 1 Ghost in the Shell: Stand Alone Complex      TV
## 2                     eX-Driver the Movie   Movie
## 3                                 Mizuiro     OVA
## 4                                Shuffle!      TV
## 5              Girls Bravo: Second Season      TV
## 6                           Buttobi!! CPU     OVA
## 7             Mazeâ\230†Bakunetsu Jikuu (TV)      TV
## 8     Pokemon: Senritsu no Mirage Pokemon Special
## 
## [[2]]
##                                            name type
## 1                Ai Shimai 2: Futari no Kajitsu  OVA
## 2                   Otome wa Boku ni Koishiteru   TV
## 3                            Babel Nisei (1992)  OVA
## 4  Soreyuke! Uchuu Senkan Yamamoto Yohko (1999)   TV
## 5                              Shintaisou: Shin  OVA
## 6                                Romeo x Juliet   TV
## 7                            Cosmo Warrior Zero   TV
## 8                                     Bartender   TV
## 9                          Green Green Specials  OVA
## 10                            Galaxy Angel Rune   TV
## 
## [[3]]
##                                                  name    type
## 1               Geobreeders: File-X Chibi Neko Dakkan     OVA
## 2 Detective Conan Movie 09: Strategy Above the Depths   Movie
## 3                        Fushigiboshi noâ\230†Futagohime      TV
## 4                                Boukyaku no Senritsu      TV
## 5                            New Dominion Tank Police     OVA
## 6                         Lupin III: Nusumareta Lupin Special
## 7                                         Green Green      TV
## 8                                       Buttobi!! CPU     OVA
## 9                                        Blood Royale     OVA
## 
## [[4]]
##                                        name  type
## 1                         Kage kara Mamoru!    TV
## 2           Yuâ\230†Giâ\230†Oh!: Duel Monsters GX    TV
## 3                        Gokinjo Monogatari    TV
## 4                          Tokyo Godfathers Movie
## 5      Grappler Baki: Saidai Tournament-hen    TV
## 6 Bishoujo Senshi Sailor Moon: Sailor Stars    TV
## 7                           Weiß Kreuz OVA   OVA
## 8                              Zetsuai 1989   OVA
## 9                                    Blame!   ONA

Hybrid Recommender

In order to incorporate serendipity, novelty, or diversity we created a hybrid model, where we used the following weights:

50% for IBCF
30% for POPULAR
10% for RERECOMMEND
10% for RANDOM

Predict

## $`226`
##  [1] 1038   45  671  996  728  590  226    8 1633  139
## 
## $`392`
##  [1] 1038 1024  996  705  761 1633 1041 1125  529   70
## 
## $`478`
##  [1]  482 1038  858  283  226  154 1633  529  671 1287
## 
## $`804`
##  [1]  333 1708  482  378  558  698 1125 1038 1492  600
## 
## $`2632`
##  [1]  761 1633 1038  698    8 1708  245 1185 1717  378
## 
## $`3009`
##  [1]  616 1229  810   18  724  271  691  477 1347  154
## 
## $`3117`
##  [1] 1038  996  332 1185 1671  590   15  467  558 1161
## 
## $`3338`
##  [1] 1287 1186 1125 1185 1655  627  567  425  285  154

Some of the items recommended by IBCF and SVD did repeat in the hybrid recommeder.

Let’s see the actual items recommended.

## [[1]]
##                                        name    type
## 1 Naruto: Akaki Yotsuba no Clover wo Sagase Special
## 2 Bishoujo Senshi Sailor Moon: Sailor Stars      TV
## 3                       eX-Driver the Movie   Movie
## 4           Yuâ\230†Giâ\230†Oh!: Duel Monsters GX      TV
## 5   Haru no Ashioto The Movie: Ourin Dakkan   Movie
## 6                          Shintaisou: Kari     OVA
## 7  Soreyuke! Uchuu Senkan Yamamoto Yohko II     OVA
## 8                   Ojamajo Doremi Dokkaan!      TV
## 
## [[2]]
##                               name  type
## 1  Yuâ\230†Giâ\230†Oh!: Duel Monsters GX    TV
## 2                          Mizuiro   OVA
## 3              Gunparade Orchestra    TV
## 4                    Akage no Anne    TV
## 5                       Elfen Lied    TV
## 6                      Shaman King    TV
## 7                 Shintaisou: Kari   OVA
## 8             Saishuu Heiki Kanojo    TV
## 9              Lemon Angel Project    TV
## 10   Odin: Koushi Hansen Starlight Movie
## 
## [[3]]
##                                                         name    type
## 1                                                    Mizuiro     OVA
## 2                                            Weiß Kreuz OVA     OVA
## 3                  Bishoujo Senshi Sailor Moon: Sailor Stars      TV
## 4 Chou Henshin Cosprayers vs. Ankoku Uchuu Shougun the Movie   Movie
## 5                  Naruto: Akaki Yotsuba no Clover wo Sagase Special
## 6                                           Shintaisou: Kari     OVA
## 7                                               Zetsuai 1989     OVA
## 8                                       Saishuu Heiki Kanojo      TV
## 
## [[4]]
##                                        name    type
## 1                          Shintaisou: Kari     OVA
## 2                                   Mizuiro     OVA
## 3              Battle Athletess Daiundoukai     OVA
## 4             Mobile Police Patlabor: WXIII   Movie
## 5 Bishoujo Senshi Sailor Moon: Sailor Stars      TV
## 6 Naruto: Akaki Yotsuba no Clover wo Sagase Special
## 7                    Android Ana Maico 2010      TV
## 8                        Petshop of Horrors      TV

Calculating and comparing accuracies

IBCF

SVD

Hybrid

TopN

Recommender
TopN Accuracy
TP FP FN TN precision recall TPR FPR
anime_item_acc1 0.2710027 8.065041 109.1463 1598.518 0.0382322 0.0055655 0.0055655 0.0048232
anime_svd_acc1 1.3929539 8.607046 108.0244 1597.976 0.1392954 0.0066003 0.0066003 0.0052008
anime_hy_acc1 0.9186992 9.081301 108.4986 1597.501 0.0918699 0.0082115 0.0082115 0.0055893

Ratings

Recommender
Ratings Accuracy
RMSE MSE MAE
anime_item_acc2 1.546860 2.392774 1.094808
anime_svd_acc2 1.673719 2.801336 1.224757
anime_hy_acc2 1.595474 2.545536 1.160603

ROC Curve

## IBCF run fold/sample [model time/prediction time]
##   1
## Warning in .local(x, ...): x was already normalized by row!
## [25.6sec/0.08sec] 
##   2
## Warning in .local(x, ...): x was already normalized by row!
## [24.32sec/0.08sec] 
##   3
## Warning in .local(x, ...): x was already normalized by row!
## [24.78sec/0.06sec] 
##   4
## Warning in .local(x, ...): x was already normalized by row!
## [24.08sec/0.08sec] 
## SVD run fold/sample [model time/prediction time]
##   1
## Warning in .local(x, ...): x was already normalized by row!
## [0.66sec/0.27sec] 
##   2
## Warning in .local(x, ...): x was already normalized by row!
## [0.63sec/0.28sec] 
##   3
## Warning in .local(x, ...): x was already normalized by row!
## [0.85sec/0.28sec] 
##   4
## Warning in .local(x, ...): x was already normalized by row!
## [0.64sec/0.28sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1
## Warning in .local(x, ...): x was already normalized by row!
## [0.02sec/1.3sec] 
##   2
## Warning in .local(x, ...): x was already normalized by row!
## [0.01sec/1.31sec] 
##   3
## Warning in .local(x, ...): x was already normalized by row!
## [0.01sec/1.31sec] 
##   4
## Warning in .local(x, ...): x was already normalized by row!
## [0.02sec/1.31sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.32sec] 
##   2  [0sec/0.32sec] 
##   3  [0sec/0.33sec] 
##   4  [0.02sec/0.35sec]

Runtime comparison

Run Time
IBCF Model Training: 25.76 sec elapsed
IBCF Model Predicting: top 10: 0.11 sec elapsed
IBCF Model Predicting: ratings: 0.06 sec elapsed
SVD Model Training: 0.39 sec elapsed
SVD Model predicting: top 10: 0.32 sec elapsed
SVD Model predicting: ratings: 0.21 sec elapsed
Hybrid Recommender Training: 26.37 sec elapsed
Hybrid Recommender Predicting: top 10: 6.55 sec elapsed
Hybrid Recommender Predicting: ratings: 6.38 sec elapsed

Summary

We know that the low error along with lower runtime is an indicator of good performance. Based on our Accuracy table and Runtime table, Hybrid has the best performance. An explanation for this is it makes up for the shortcoming of each other. We also noted that there are problems recommending items to new user because of lack of historical data.

Take home from this course

In our day to day life, we used recommender like Amazon.com, Netflix.com etc, but didn’t know about the underlying algorithms. This course offered an opportunity to learn them through a variety of exercises.