1. Introduction

MyAnimeList, also known as MAL, is an anime and manga social networking website which contains a database where users can organize and add different anime to their list. When added to a list the anime items are given a rating after being watched. This process helps in finding users who have similar tastes. This project will explore the contents of this dataset to gain insights. Later on, an item-item collaborative filtering recommeder system will be built to recommend and predict anime for users. Analysis and evaluation will be done on the recommender system to see how well it performs when recommending items.

The data was obtained from Kaggle.com and contains information from 73,516 users who may have given a rating to one of 12,294 anime items. The scores/ratings range from 1 - 10 with 10 being the best. If the rating is -1, it means that the user did not provide a rating for that item.

2. Objective / Motivation

The goal of this project is to recommend and make predictions about a user’s taste. Specifically what a user will want to watch or buy in the future. In order to do such predictions, large amounts of user data is needed to find patterns and associate prior tastes with future choices. Often times, it is difficult to provide good recommendations when users’ information is limited. Of course it is better when users give their information explicitly but not as much as we’d like. Therefore sparsity is introduced. However, in order to produce meaningful recommendations, I propose three techniques: (1) Item-item collaborative filtering, (2) Single Value Decomposition (SVD) and (3) Hybrid Recommender System. The system will be implemented in R using a training and test set with a ratio of 80%:20% respectively. The error for each model will be reported as root mean square error (RMSE) as a measure for perfomance.

3. My Anime List Recommender System

3.1 Data Pre-processing

3.1.1 Import files

## Observations: 7,813,737
## Variables: 3
## $ user_id  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ anime_id <int> 20, 24, 79, 226, 241, 355, 356, 442, 487, 846, 936, 1...
## $ rating   <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -...
## Observations: 12,294
## Variables: 7
## $ anime_id <int> 32281, 5114, 28977, 9253, 9969, 32935, 11061, 820, 15...
## $ name     <fct> Kimi no Na wa., Fullmetal Alchemist: Brotherhood, Gin...
## $ genre    <fct> "Drama, Romance, School, Supernatural", "Action, Adve...
## $ type     <fct> Movie, TV, TV, TV, TV, TV, TV, OVA, Movie, TV, TV, Mo...
## $ episodes <fct> 1, 64, 51, 24, 51, 10, 148, 110, 1, 13, 24, 1, 201, 2...
## $ rating   <dbl> 9.37, 9.26, 9.25, 9.17, 9.16, 9.15, 9.13, 9.11, 9.10,...
## $ members  <int> 200630, 793665, 114262, 673572, 151266, 93351, 425855...

3.1.2 Clean Data

According to the description found with the data, the ratings are from 1 - 10. Notice that if a user did not rate an item, the item received a rating of -1. For simplicity, I will change -1 to NA to indicate the rating is missing. Added to that, I will aslo change the data type for some variables.

3.2 Exploratory Data Analysis

Before we create a matrix to build the recommenders, let’s gather some insights from the data.

3.2.1 Highest rated items

## Selecting by members
anime_id name genre type episodes rating members
5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Military, Shounen TV 64 9.26 793665
1575 Code Geass: Hangyaku no Lelouch Action, Mecha, Military, School, Sci-Fi, Super Power TV 25 8.83 715151
1535 Death Note Mystery, Police, Psychological, Supernatural, Thriller TV 37 8.71 1013917
16498 Shingeki no Kyojin Action, Drama, Fantasy, Shounen, Super Power TV 25 8.54 896229
6547 Angel Beats! Action, Comedy, Drama, School, Supernatural TV 13 8.39 717796
11757 Sword Art Online Action, Adventure, Fantasy, Game, Romance TV 25 7.83 893100
20 Naruto Action, Comedy, Martial Arts, Shounen, Super Power TV 220 7.81 683297

3.2.2 Most watched type of show


About 25 anime items type were unknown and most of them are under the TV category.

Note


ONA - Original Net Animation (ONA) is an anime that is directly released onto the Internet
OVA - Original Video Animation (OVA) is an animated film or series made specially for release in home-video formats

3.2.3 Anime with the most members

## Selecting by members
anime_id name genre type episodes rating members
1535 Death Note Mystery, Police, Psychological, Supernatural, Thriller TV 37 8.71 1013917
16498 Shingeki no Kyojin Action, Drama, Fantasy, Shounen, Super Power TV 25 8.54 896229
11757 Sword Art Online Action, Adventure, Fantasy, Game, Romance TV 25 7.83 893100
5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Military, Shounen TV 64 9.26 793665
6547 Angel Beats! Action, Comedy, Drama, School, Supernatural TV 13 8.39 717796
1575 Code Geass: Hangyaku no Lelouch Action, Mecha, Military, School, Sci-Fi, Super Power TV 25 8.83 715151
20 Naruto Action, Comedy, Martial Arts, Shounen, Super Power TV 220 7.81 683297
9253 Steins;Gate Sci-Fi, Thriller TV 24 9.17 673572
10620 Mirai Nikki (TV) Action, Mystery, Psychological, Shounen, Supernatural, Thriller TV 26 8.07 657190
4224 Toradora! Comedy, Romance, School, Slice of Life TV 25 8.45 633817

Let’s move on to creating a User-Item matrix

3.3 User-Item Matrix

## 73515 x 11200 rating matrix of class 'realRatingMatrix' with 7813730 ratings.

A lot of the data is sparse and uses a lot of memory. For instance the size of this matrix is about 99 Mb.

## 99233736 bytes

I will cut the size of the matrix down where it will only contain data for users who rated at least 500 anime shows and shows that were rated at least 1000 times.

## 1843 x 1720 rating matrix of class 'realRatingMatrix' with 967727 ratings.
## 11850056 bytes

3.4 Similarity

Similarity among the first 50 users

Similarity among the first 50 anime items


Based on the similarity plots, items have more in common than users do with each other.

3.5 Building Recommender Systems

Recommender Systems are systems that aim to predict users’ interests and recommend items that are likely to interest them. They help uers make decisions by discovering new and relevant items. As mentioned earlier, we will look at the way three types of recommenders work.

At first we will divide the data into training and test sets so that the recommender algorithms can learn the data then try to predict releant outcomes.

Training and Test sets

## Evaluation scheme with 4 items given
## Method: 'split' with 4 run(s).
## Training set proportion: 0.800
## Good ratings: >=5.000000
## Data set: 1843 x 1720 rating matrix of class 'realRatingMatrix' with 967727 ratings.
## Normalized using center on rows.

3.5.1 Item-Item Collaborative Filtering

Item based recommender
## Warning in .local(x, ...): x was already normalized by row!
## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 1474 users.
Predict

Let’s see for the first 4 users.

## $`201`
##  [1]  451  479  893  968  978  982 1624    5   77  124
## 
## $`392`
##  [1]  3  5 10 15 19 22 32 54 56 59
## 
## $`446`
##  [1]  17  21  26  30  62  63  91  99 114 115
## 
## $`661`
##  [1]   28  168  250  366  480  982 1008 1016 1061 1191
## 
## $`771`
##  [1]   1   3  21  22  45  70  79  81  99 123
## 
## $`917`
## integer(0)
## 
## $`1522`
##  [1]   31   71  100  834  965 1072 1237 1277 1380 1387
## 
## $`1530`
##  [1]  807 1005 1039 1064 1094 1167 1284 1384   25   29

Notice that for some users, items were not recommended to them. Here we have the cold start problem. The recommender does not have adequate information about a user or an item in order to make relevant predictions. This happens often with collaborative filtering recommender systems and such problems reduces performance. The profile of such new user or item will be empty since he has not rated any item hence, their taste is not known to the system.

Let’s see what were actually recommended for the some users.

## [[1]]
##                       name type
## 1         Tenshi Kinryouku  OVA
## 2            School Rumble   TV
## 3            Ai Yori Aoshi   TV
## 4    Mobile Suit Gundam ZZ   TV
## 5  Mobile Suit Gundam Wing   TV
## 6                  Futakoi   TV
## 7        Tokyo Underground   TV
## 8              Angel Heart   TV
## 9       Grappler Baki (TV)   TV
## 10         Ace wo Nerae! 2  OVA
## 
## [[2]]
##                                      name type
## 1              Hungry Heart: Wild Striker   TV
## 2                               One Piece   TV
## 3                              Texhnolyze   TV
## 4                 Neon Genesis Evangelion   TV
## 5                           D.C.: Da Capo   TV
## 6                                   DearS   TV
## 7  Mobile Suit Gundam Wing: Endless Waltz  OVA
## 8                               Mai-Otome   TV
## 9             Sakigake!! Cromartie Koukou   TV
## 10       El Hazard: The Alternative World   TV
## 
## [[3]]
##                                             name  type
## 1                Cowboy Bebop: Tengoku no Tobira Movie
## 2                                   Eyeshield 21    TV
## 3                                        Monster    TV
## 4                               Prince of Tennis    TV
## 5 Neon Genesis Evangelion: The End of Evangelion Movie
## 6                              Appleseed (Movie) Movie
## 7                                        Avenger    TV
## 8                                        Chobits    TV
## 
## [[4]]
##                              name type
## 1     Mahou Shoujo Lyrical Nanoha   TV
## 2               Shakugan no Shana   TV
## 3                        Burn Up!  OVA
## 4             Street Fighter II V   TV
## 5             Ginga Densetsu Weed   TV
## 6 The Third: Aoi Hitomi no Shoujo   TV
## 7                   Tokyo Babylon  OVA
## 8                          Blame!  ONA
## 9                    Melty Lancer  OVA

3.5.2 Single Value Decomposition

## Warning in .local(x, ...): x was already normalized by row!
## Recommender of type 'SVD' for 'realRatingMatrix' 
## learned using 1474 users.
Predict

Lets see what SVD recommends

## $`201`
##  [1]  689  343  787  536  660  742 1149  728  732  646
## 
## $`392`
##  [1] 467   9 458 761  70 335 159 482 590 133
## 
## $`446`
##  [1] 482 590  41 467 558 146 574  22  79 154
## 
## $`661`
##  [1]  208  759 1024 1415  570  973  495  490  852  135
## 
## $`771`
##  [1]  467  590  482  852  869  698   70 1030 1041 1102
## 
## $`917`
##  [1]  852  482  590  698  759  551 1158 1102 1115  185
## 
## $`1522`
##  [1] 1431 1359  590 1158 1262 1186 1326 1233 1230 1268
## 
## $`1530`
##  [1]  590 1262  482  996 1230  852 1268 1233 1166 1179

Unlike Item recommender, the SVD algorithm provided a recommendation for every user. In general, SVD is a commonly used method to estimate missing data in a data matrix. When you consider that recommender systems are essentially trying to estimate missing ratings for users, the use of SVD makes sense. Comparing to the IBCF, some are the same.

Now let’s have a look at what the numbers match to.

## Warning: Column `guess`/`anime_id` joining factors with different levels,
## coercing to character vector

## Warning: Column `guess`/`anime_id` joining factors with different levels,
## coercing to character vector

## Warning: Column `guess`/`anime_id` joining factors with different levels,
## coercing to character vector

## Warning: Column `guess`/`anime_id` joining factors with different levels,
## coercing to character vector
## [[1]]
##                                            name  type
## 1                Hanbun no Tsuki ga Noboru Sora    TV
## 2                                        Naruto    TV
## 3                       Musekinin Kanchou Tylor    TV
## 4                               Mousou Dairinin    TV
## 5 Kidou Senkan Nadesico: The Prince of Darkness Movie
## 6           Mousou Kagaku Series: Wandaba Style    TV
## 7                          Boukyaku no Senritsu    TV
## 
## [[2]]
##                                      name type
## 1         Yuâ\230†Giâ\230†Oh!: Duel Monsters GX   TV
## 2                       Kage kara Mamoru!   TV
## 3 Ghost in the Shell: Stand Alone Complex   TV
## 4                                Major S2   TV
## 5       Kono Minikuku mo Utsukushii Sekai   TV
## 6                         Rean no Tsubasa  ONA
## 7                        Prince of Tennis   TV
## 8                                Shuffle!   TV
## 9                             Shaman King   TV
## 
## [[3]]
##                                        name    type
## 1   Ghost in the Shell: Stand Alone Complex      TV
## 2                             Buttobi!! CPU     OVA
## 3 Naruto: Akaki Yotsuba no Clover wo Sagase Special
## 4                    Matantei Loki Ragnarok      TV
## 5                      Boukyaku no Senritsu      TV
## 6           Yuâ\230†Giâ\230†Oh!: Duel Monsters GX      TV
## 7                         Kage kara Mamoru!      TV
## 8                               Green Green      TV
## 
## [[4]]
##                                    name type
## 1                       Gunslinger Girl   TV
## 2 Geobreeders: File-X Chibi Neko Dakkan  OVA
## 3                   Lemon Angel Project   TV
## 4       Yuâ\230†Giâ\230†Oh!: Duel Monsters GX   TV
## 5                            Burn Up! W  OVA
## 6               Macross Flash Back 2012  OVA
## 7                     Kage kara Mamoru!   TV
## 8                 Mujin Wakusei Survive   TV

3.5.2 Hybrid Recommender

The ultimate hybrid recommender containing Item-Item CF, grouped with what the user previously liked, diversity and popular options.

## Warning in .local(x, ...): x was already normalized by row!

## Warning in .local(x, ...): x was already normalized by row!
## Recommender of type 'HYBRID' for 'ratingMatrix' 
## learned using NA users.
## $`201`
##  [1]  451 1624  479  893  968  982  798  677  349  978
## 
## $`392`
##  [1]  378   19 1125 1423  154 1161  754    5  823  458
## 
## $`446`
##  [1]  823  555  588   21  773 1080  567  209 1623  671
## 
## $`661`
##  [1]  982 1399  168 1421  996 1717 1481 1536  627 1474
## 
## $`771`
##  [1]  154   99   21  996  214 1290 1024  529 1523    1
## 
## $`917`
##  [1] 1125  761  810  482  555 1290  773 1029   45 1185
## 
## $`1522`
##  [1] 1583  965   71 1072  100  834 1387 1656 1446 1380
## 
## $`1530`
##  [1]  807 1284 1167 1064 1005 1039 1094 1384  378 1038


Some of the items recommended by IBCF and SVD did repeat in the hybrid recommeder.

Let’s see the actual items recommended

## Warning: Column `guess`/`anime_id` joining factors with different levels,
## coercing to character vector

## Warning: Column `guess`/`anime_id` joining factors with different levels,
## coercing to character vector

## Warning: Column `guess`/`anime_id` joining factors with different levels,
## coercing to character vector

## Warning: Column `guess`/`anime_id` joining factors with different levels,
## coercing to character vector
## [[1]]
##                               name  type
## 1    Odin: Koushi Hansen Starlight Movie
## 2          Kyattou Ninden Teyandee    TV
## 3                  Ace wo Nerae! 2   OVA
## 4                    School Rumble    TV
## 5                     Genma Taisen Movie
## 6                 Green Legend Ran   OVA
## 7            Gift: Eternal Rainbow    TV
## 8               Aria The Animation    TV
## 9 Harlock Saga: Nibelung no Yubiwa   OVA
## 
## [[2]]
##                        name type
## 1              Virgin Night  OVA
## 2  Koutetsu Tenshi Kurumi 2   TV
## 3       Itsudatte My Santa!  OVA
## 4                 One Piece   TV
## 5        Tenamonya Voyagers  OVA
## 6    Sentou Yousei Yukikaze  OVA
## 7                 The Big O   TV
## 8              R.O.D the TV   TV
## 9               G-On Riders   TV
## 10      Lemon Angel Project   TV
## 
## [[3]]
##                                   name    type
## 1                  eX-Driver the Movie   Movie
## 2                              Monster      TV
## 3 Lupin III: Napoleon no Jisho wo Ubae Special
## 4                          Shaman King      TV
## 5          Mazeâ\230†Bakunetsu Jikuu (TV)      TV
## 6                         Yuki no Joou      TV
## 7      Cowboy Bebop: Tengoku no Tobira   Movie
## 8                         Virgin Night     OVA
## 9                        Buttobi!! CPU     OVA
## 
## [[4]]
##                                      name  type
## 1                            Melty Lancer   OVA
## 2                  Katekyo Hitman Reborn!    TV
## 3                 Macross Flash Back 2012   OVA
## 4                                  Blame!   ONA
## 5         The Third: Aoi Hitomi no Shoujo    TV
## 6          Lupin III: Ikiteita Majutsushi   OVA
## 7 Haru no Ashioto The Movie: Ourin Dakkan Movie
## 8                     Ginga Densetsu Weed    TV
## 9                     Street Fighter II V    TV

3.7 Evaluation

ITEM

SVD

HYBRID

TopN

Recommender
TopN Accuracy
TP FP FN TN precision recall TPR FPR
anime_item_acc1 0.3035230 7.967480 114.7751 1592.954 0.0385451 0.0035823 0.0035823 0.0047436
anime_svd_acc1 1.4336043 8.566396 113.6450 1592.355 0.1433604 0.0156471 0.0156471 0.0051708
anime_hy_acc1 0.9403794 9.059621 114.1382 1591.862 0.0940379 0.0053911 0.0053911 0.0060231

Ratings

Recommender
Ratings Accuracy
RMSE MSE MAE
anime_item_acc2 1.596660 2.549324 1.107453
anime_svd_acc2 1.688743 2.851854 1.235027
anime_hy_acc2 1.583035 2.506001 1.124243

To sum up this table, the lower the numbers, the better the performance of the model.

Comparing Models
## IBCF run fold/sample [model time/prediction time]
##   1
## Warning in .local(x, ...): x was already normalized by row!
## [60.31sec/0.15sec] 
##   2
## Warning in .local(x, ...): x was already normalized by row!
## [57.41sec/0.17sec] 
##   3
## Warning in .local(x, ...): x was already normalized by row!
## [58.83sec/0.19sec] 
##   4
## Warning in .local(x, ...): x was already normalized by row!
## [56.03sec/0.2sec] 
## SVD run fold/sample [model time/prediction time]
##   1
## Warning in .local(x, ...): x was already normalized by row!
## [0.94sec/0.56sec] 
##   2
## Warning in .local(x, ...): x was already normalized by row!
## [1.01sec/0.54sec] 
##   3
## Warning in .local(x, ...): x was already normalized by row!
## [0.92sec/0.6sec] 
##   4
## Warning in .local(x, ...): x was already normalized by row!
## [1.15sec/0.58sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1
## Warning in .local(x, ...): x was already normalized by row!
## [0.02sec/2.43sec] 
##   2
## Warning in .local(x, ...): x was already normalized by row!
## [0.03sec/2.14sec] 
##   3
## Warning in .local(x, ...): x was already normalized by row!
## [0.03sec/2sec] 
##   4
## Warning in .local(x, ...): x was already normalized by row!
## [0.03sec/2.06sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.01sec/0.64sec] 
##   2  [0sec/0.7sec] 
##   3  [0.01sec/0.61sec] 
##   4  [0.01sec/0.63sec]

ROC Curve


The closer the curve is to the top right, it indicates a better performance.

Precision-Recall


The closer the curve is to the top left, the better the performance. In this case, the Singular Value Decomposition algorithm performed best.

4. Conclusion

Overall, the Hybrid Recommender performed best due to it having the lowest error score. This was expected because when you have a hybrid recommender, the algorithms make up for the shortcomings of each other. As mentioned earlier, Item based recommender had the trouble of recommending items for some new users. This is a problem for collaborative filtering recommenders due to a of lack of enough information where only a few of the total number of items available in a database rated by users. Therefore, there comes the inability to locate successful neighbors and finally, the generation of weak recommendations.

To conclude, recommender systems open new opportunities of retrieving personalized information on the web. It also helps to alleviate the problem of information overload which is a very common circumstance with information retrieval systems and enables users to have access to products and services which are not readily available to users on the system. This prject discussed the three recommendation techniques and highlighted their strengths and weaknesses. Various learning algorithms used in generating the recommendation models and evaluation metrics were used to measure the quality and performance of the algorithms discussed.

5. References