publication.utf8

Introduction

MyAnimeList, also known as MAL, is the world’s largest anime and manga database and community which contains a database where users can organize and add different anime to their list. When added to a list the anime items are given a rating after being watched. This process helps in finding users who have similar tastes. This project will explore the contents of this dataset to gain insights. Later on, an item-item collaborative filtering recommeder system will be built to recommend and predict anime for users. Analysis and evaluation will be done on the recommender system to see how well it performs when recommending items.

myanimelist.net API provides anime data and user ratings. The data was obtained from Kaggle Datasets and contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings. The scores/ratings range from 1 - 10 with 10 being the best. If the rating is -1, it means that the user did not provide a rating for that item.

The initial data looked like the dataframe below.

anime_id	name	genre	type	episodes	rating	members
32281	Kimi no Na wa.	Drama, Romance, School, Supernatural	Movie	1	9.37	200630
5114	Fullmetal Alchemist: Brotherhood	Action, Adventure, Drama, Fantasy, Magic, Military, Shounen	TV	64	9.26	793665
28977	GintamaÂ°	Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen	TV	51	9.25	114262
9253	Steins;Gate	Sci-Fi, Thriller	TV	24	9.17	673572
9969	Gintama'	Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen	TV	51	9.16	151266
32935	Haikyuu!!: Karasuno Koukou VS Shiratorizawa Gakuen Koukou	Comedy, Drama, School, Shounen, Sports	TV	10	9.15	93351

The following is an explanation of the features contained in the entire dataset.

Anime.csv

anime_id - myanimelist.net’s unique id identifying an anime.
name - full name of anime.
genre - comma separated list of genres for this anime.
type - movie, TV, OVA, etc.
episodes - how many episodes in this show. (1 if movie).
rating - average rating out of 10 for this anime.
members - number of community members that are in this anime’s “group”.

user_id	anime_id	rating
1	20	-1
1	24	-1
1	79	-1
1	226	-1

Rating.csv

user_id - non identifiable randomly generated user id.
anime_id - the anime that this user has rated.
rating - rating out of 10 this user has assigned (-1 if the user watched it but didn’t assign a rating). According to the description found with the data, the ratings are from 1 - 10. Notice that if a user did not rate an item, the item received a rating of -1. For simplicity, I will change -1 to NA to indicate the rating is missing. Added to that, I will also change the data type for some variables.

Data Preparation

After the dataset is collected, the next step in the process is preprocessing. At this stage we do the process of data wrangling or data mining which in other words is often interpreted as transforming data into a tidy form and ready to be analyzed.

Missing data can be a not so trivial problem when analysing a dataset and accounting for it is usually not so straightforward either. But for this anime dataframe, we can directly remove NA value

type : 25 observations
rating : 205 observations

anime <- anime[anime$type != "",]
anime$type <- droplevels(anime$type)
anime <- anime %>% 
  drop_na(rating)

After observe the data, we can do some preprocesses that will be applied to the dataset are as follows :

anime <- anime %>% 
  mutate(
   anime_id = as.factor(anime_id),
   name = as.character(name),
   genre = as.character(genre),
   episodes =  as.numeric(as.character(episodes))
  )


ratings <- ratings %>% 
  mutate(
    user_id = as.factor(user_id),
    anime_id = as.factor(anime_id)
  )

Then we can convert -1 in rating column (ratings dataframe) into NA value. There are many ways to approach missing data, such as imputation. Imputation simply means replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values. But in this case I will leave these NA value.

ratings$rating[ratings$rating == -1] <- NA

Data Exploration

Before we proceed further, the thing that needs to be done before modeling is Exploratory Data Analysis. At this point, we can analyze the distribution of our dataset by their features/variables. The following is the EDA generated from rating.csv data.

Rating Distribution

Rating of all anime is normally distributed and has averages rating (6.473902) between 6.4 and 6.5.

Type Distribution

Then we try to find out how the distribution of the Anime type itself. This can be seen based on the number of Animes contained in each type and of course the Average Rating of animes in those types.

Genre Distribution

#> As we can see, they made 3230 different combinations. We can split them into a single genre list, that would be good for our further analysis

#> [1] "Types of Genre -> "

#>  [1] "Comedy"        "Action"        "Fantasy"       "Sci-Fi"       
#>  [5] "Drama"         "Shounen"       "Kids"          "Adventure"    
#>  [9] "Romance"       "SliceofLife"   "School"        "Hentai"       
#> [13] "Supernatural"  "Mecha"         "Music"         "Historical"   
#> [17] "Magic"         "Ecchi"         "Shoujo"        "Sports"       
#> [21] "Seinen"        "Mystery"       "SuperPower"    "Military"     
#> [25] "Parody"        "Space"         "Horror"        "Harem"        
#> [29] "Demons"        "MartialArts"   "Dementia"      "Psychological"
#> [33] "Police"        "Game"          "Samurai"       "Vampire"      
#> [37] "Thriller"      "Cars"          "ShounenAi"     "ShoujoAi"     
#> [41] "Josei"         "Yuri"          "Yaoi"

User-Item Matrix

Next, we can transform our data into a Real Rating Matrix to build a recommendation engine. Before proceeding to that stage, we must filter our data for computing reason. We can cut the size of the matrix down where it will only contain data for users who rated at least 500 anime shows and shows that were rated at least 1000 times.

user_filter <- ratings %>% 
  group_by(user_id) %>% 
  summarise(n=n()) %>% 
  filter(n>=500)
#1853

anime_filter <- ratings %>% 
  group_by(anime_id) %>% 
  summarise(n=n()) %>% 
  filter(n>=1000)
#1721


ratings_filter <- ratings %>% 
  filter(user_id %in% user_filter$user_id,
         anime_id %in% anime_filter$anime_id)

anime_matrix <- as(ratings_filter, "realRatingMatrix")

#> 1853 x 1721 rating matrix of class 'realRatingMatrix' with 971835 ratings.

From this anime matrix, we can also observe their rating distribution based on what the user has rated.

Based on the users providing the ratings, it seems the shows are really good because majority are rated 8 and up.

To improve the recommendation performance, normalization is always used as a basic component for the predictor models.

anime_matrix <- normalize(anime_matrix)

Recommender Systems

There are basically 2 approaches to make a recommendation. Let’s say we want to recommend a set of additional products to a customer who purchased a product X:

We can try to find out what in the product X was so attractive for the customer and suggest products having this “what“. We called them Content based recommender systems.
We check for all other users who purchased product X as well, and make a list of other products purchased by these users. Out of this list, you take the products repeating the most. We called them Collaborative filtering recommender systems

Recommender Systems are systems that aim to predict users’ interests and recommend items that are likely to interest them. They help uers make decisions by discovering new and relevant items. As mentioned earlier, we will look at the way three types of recommenders work.

Cross Validation

At first we will divide the data into training and test sets so that the recommender algorithms can learn the data then try to predict relevant outcomes.

#> Evaluation scheme with 4 items given
#> Method: 'split' with 4 run(s).
#> Training set proportion: 0.800
#> Good ratings: >=5.000000
#> Data set: 1853 x 1721 rating matrix of class 'realRatingMatrix' with 971835 ratings.
#> Normalized using center on rows.

Item-Item Collaborative Filtering

We are going to create a model called IBCF or I(tem) B(ased) C(ollaborative) F(iltering). Item Based Collaborative Filtering takes the similarities between items’ consumption history.

#> [1] 1482 1721

#> [1]  371 1721

anime_item_recc <- Recommender(data = getData(anime_eval, "train"), method = "IBCF")

#> Recommender of type 'IBCF' for 'realRatingMatrix' 
#> learned using 1482 users.

PREDICT

anime_pred <- predict(object = anime_item_recc, newdata = getData(anime_eval, "known"), n = 10)

anime_predr <- predict(object = anime_item_recc, newdata = getData(anime_eval, "known"), type = "ratings")

Let’s see for the first 4 users.

#> $`226`
#>  [1] 1041 1125 1254 1264 1323 1325 1329 1336 1410 1482
#> 
#> $`1019`
#>  [1] 134 159 335 621 724 745 856 868 915 918
#> 
#> $`1504`
#>  [1]   7  19  24  32  42  67  91 111 112 113
#> 
#> $`1522`
#>  [1] 578 794 835 889 925 948 958 965 976 995
#> 
#> $`1530`
#>  [1]  55  70  71  79  81 105 134 159 482 560
#> 
#> $`1984`
#>  [1] 122 261 262 416 417 418 419 420 463 476
#> 
#> $`2273`
#>  [1]  935 1175 1176 1402 1562 1579 1675 1686    7   23
#> 
#> $`2297`
#>  [1]  725  760  769  843  846  863  957  976 1180 1301

Notice that for some users, items were not recommended to them. Here we have the cold start problem. The recommender does not have adequate information about a user or an item in order to make relevant predictions. This happens often with collaborative filtering recommender systems and such problems reduces performance. The profile of such new user or item will be empty since he has not rated any item hence, their taste is not known to the system.

Let’s see what were actually recommended for the some users.

#> [[1]]
#>                        name    type
#> 1        Gokinjo Monogatari      TV
#> 2            Virtua Fighter      TV
#> 3          Crayon Shin-chan      TV
#> 4          Oruchuban Ebichu      TV
#> 5    Sentou Yousei Yukikaze     OVA
#> 6 Seikai no Danshou: Tanjou Special
#> 7  Hikari to Mizu no Daphne      TV
#> 
#> [[2]]
#>                                             name  type
#> 1                             Witch Hunter Robin    TV
#> 2                                        Monster    TV
#> 3                                  School Rumble    TV
#> 4 Neon Genesis Evangelion: The End of Evangelion Movie
#> 5                    Basilisk: Kouga Ninpou Chou    TV
#> 6         Mobile Suit Gundam Wing: Endless Waltz   OVA
#> 7                                  Corrector Yui    TV
#> 8                        Chou Henshin Cosprayers    TV
#> 9                              Uchuu no Stellvia    TV
#> 
#> [[3]]
#>                               name  type
#> 1                  Gunslinger Girl    TV
#> 2             Boukyaku no Senritsu    TV
#> 3           Matantei Loki Ragnarok    TV
#> 4 Night Walker: Mayonaka no Tantei    TV
#> 5                            Enzai   OVA
#> 6                    Utawarerumono    TV
#> 7                    Slayers Great Movie
#> 8              Ginga Densetsu Weed    TV
#> 9                          Gintama    TV
#> 
#> [[4]]
#> [1] name type
#> <0 rows> (or 0-length row.names)

Single Value Decomposition

anime_SVD_recc <- Recommender(data = getData(anime_eval, "train"), method = "SVD")

#> Recommender of type 'SVD' for 'realRatingMatrix' 
#> learned using 1482 users.

PREDICT

anime_svd_pred <- predict(object = anime_SVD_recc, newdata = getData(anime_eval, "known"), n = 10) 

anime_svd_predr <- predict(object = anime_SVD_recc, newdata = getData(anime_eval, "known"), type = "ratings")

Lets see what SVD recommends for the first 4 users

#> $`226`
#>  [1] 1286 1405 1220 1315 1160 1330 1306 1360 1296 1254
#> 
#> $`1019`
#>  [1] 1159 1432 1636 1453 1502 1506 1677 1508 1656 1646
#> 
#> $`1504`
#>  [1] 1064 1552 1665  961 1685 1114 1650 1700  937  914
#> 
#> $`1522`
#>  [1] 1162  536 1528 1263 1183  487  343 1031 1225 1348
#> 
#> $`1530`
#>  [1]  591  482 1159  853  762 1031  512 1103  487  458
#> 
#> $`1984`
#>  [1]  208  591  797  482  551  184  632 1416  341  322
#> 
#> $`2273`
#>  [1] 1709 1634 1039 1718  378 1288 1556  628 1624 1291
#> 
#> $`2297`
#>  [1]  591 1263 1269  482  760 1156 1354 1229 1231  997

Unlike Item recommender, the SVD algorithm provided a recommendation for every user. In general, SVD is a commonly used method to estimate missing data in a data matrix. When you consider that recommender systems are essentially trying to estimate missing ratings for users, the use of SVD makes sense. Comparing to the IBCF, some are the same.

Now let’s have a look at what the numbers match to.

#> [[1]]
#>                                 name type
#> 1                        Kachou Ouji   TV
#> 2                       Hyper Police   TV
#> 3                       Variable Geo  OVA
#> 4             Comic Party Revolution   TV
#> 5                       Black Lagoon   TV
#> 6               Ginga Eiyuu Densetsu  OVA
#> 7                              Suika  OVA
#> 8 Yoshinaga-san&#039;chi no Gargoyle   TV
#> 9                          HeatGuy J   TV
#> 
#> [[2]]
#>                                           name  type
#> 1                                  Mazinkaiser   OVA
#> 2 Hiatari Ryoukou! Yume no Naka ni Kimi ga Ita Movie
#> 3                           Babel Nisei (1992)   OVA
#> 4                               Virtua Fighter    TV
#> 5                                 Duel Masters    TV
#> 6                          Uchuu Senkan Yamato    TV
#> 7                                    Appleseed   OVA
#> 8                      Ike! Ina-chuu Takkyuubu    TV
#> 
#> [[3]]
#>                                                     name    type
#> 1 Geobreeders 2: Mouryou Yuugekitai File-XX Ransen Toppa     OVA
#> 2                     Lupin III: Fuuma Ichizoku no Inbou   Movie
#> 3                 Ai Shimai Tsubomi... Kegashite Kudasai     OVA
#> 4                                          Maison Ikkoku      TV
#> 5                       Mahou Shoujo Pretty Sammy (1996)      TV
#> 6    Detective Conan Movie 10: Requiem of the Detectives   Movie
#> 7                              Cosmo Warrior Zero Gaiden Special
#> 8                                           Sci-fi Harry      TV
#> 9                                        PostPet Momobin      TV
#> 
#> [[4]]
#>                                                                   name  type
#> 1                                       Tenjou Tenge: The Past Chapter Movie
#> 2 Tenchi Muyou! Ryououki 3rd Season: Tenchi Seirou naredo Namitakashi?   OVA
#> 3                Bishoujo Senshi Sailor Moon S: Kaguya Hime no Koibito Movie
#> 4                                                          Kachou Ouji    TV
#> 5                                                    Bakuretsu Hunters    TV
#> 6                                                 Hachimitsu to Clover    TV
#> 7                                                 Saishuu Heiki Kanojo    TV
#> 8          RahXephon Interlude: Her and Herself/Thatness and Thereness   OVA
#> 9                                              Spiral: Suiri no Kizuna    TV

Hybrid Recommender

The ultimate hybrid recommender containing Item-Item CF, grouped with what the user previously liked, diversity and popular options.

anime_hybrid_recc <- HybridRecommender(
  Recommender(data = getData(anime_eval, "train"), method = "IBCF"),
  Recommender(data = getData(anime_eval, "train"), method = "POPULAR"),
  Recommender(data = getData(anime_eval, "train"), method = "RERECOMMEND"),
  Recommender(data = getData(anime_eval, "train"), method = "RANDOM"), #diversity
  weights = c(0.5, 0.3, 0.1, 0.1)
)

#> Recommender of type 'HYBRID' for 'ratingMatrix' 
#> learned using NA users.

PREDICT

anime_hybrid_pred <- predict(object = anime_hybrid_recc, newdata = getData(anime_eval, "known"), n = 10) 

anime_hybrid_predr <- predict(object = anime_hybrid_recc, newdata = getData(anime_eval, "known"), type = "ratings")

These are what HYBRID recommends for the first 4 users

#> $`226`
#>  [1] 1041 1254 1410  762 1336 1482 1125 1039 1325  591
#> 
#> $`1019`
#>  [1] 1550 1379 1508  868 1672 1348 1564 1707 1240 1187
#> 
#> $`1504`
#>  [1]  235  529 1493 1656 1195 1432 1704 1327  548  178
#> 
#> $`1522`
#>  [1]  378 1126  901 1195  559 1263 1291  948  997 1562
#> 
#> $`1530`
#>  [1]  824  591  105 1620  482   70  134 1327 1159   81
#> 
#> $`1984`
#>  [1] 1222  262  419  416  476  261  463  420 1204  999
#> 
#> $`2273`
#>  [1] 1710 1678 1665  710 1656 1275  343 1373 1707 1627
#> 
#> $`2297`
#>  [1]  760  769  843 1180 1301  725 1354 1605  957 1676

Some of the items recommended by IBCF and SVD did repeat in the hybrid recommeder.

Let’s see the actual items recommended

#> [[1]]
#>                                     name    type
#> 1                         Virtua Fighter      TV
#> 2              Seikai no Danshou: Tanjou Special
#> 3                     Gokinjo Monogatari      TV
#> 4                 Sentou Yousei Yukikaze     OVA
#> 5                       Crayon Shin-chan      TV
#> 6                       Oruchuban Ebichu      TV
#> 7 One: Kagayaku Kisetsu e - True Stories     OVA
#> 
#> [[2]]
#>                                  name  type
#> 1                     Detective Conan    TV
#> 2                Saishuu Heiki Kanojo    TV
#> 3       Project ARMS: The 2nd Chapter    TV
#> 4                     PostPet Momobin    TV
#> 5                    Zero no Tsukaima    TV
#> 6  Lupin III: Fuuma Ichizoku no Inbou Movie
#> 7                             Eat-Man    TV
#> 8                Aoki Densetsu Shoot!    TV
#> 9                      Wonderful Days Movie
#> 10                       Ultra Maniac    TV
#> 
#> [[3]]
#>                                                                                         name
#> 1                                                                                Attack No.1
#> 2                                       Kino no Tabi: Nanika wo Suru Tame ni - Life Goes On.
#> 3                                                                               Sci-fi Harry
#> 4                                                                              Slayers Great
#> 5                                                               Bomberman B-Daman Bakugaiden
#> 6                                                                            Bubblegum Crash
#> 7                                                                Pokemon Advanced Generation
#> 8                                                                          Kinnikuman II Sei
#> 9  Bishoujo Senshi Sailor Moon SuperS: Sailor 9 Senshi Shuuketsu! Black Dream Hole no Kiseki
#> 10                                                                                 DNAÂ² OVA
#>     type
#> 1     TV
#> 2  Movie
#> 3     TV
#> 4  Movie
#> 5     TV
#> 6    OVA
#> 7     TV
#> 8     TV
#> 9  Movie
#> 10   OVA
#> 
#> [[4]]
#>                                                     name    type
#> 1                                 Beet the Vandel Buster      TV
#> 2                                        Guardian Hearts     OVA
#> 3                                  I: Wish You Were Here      TV
#> 4                        Figure 17: Tsubasa &amp; Hikaru      TV
#> 5                           Ai Shimai: Futari no Kajitsu     OVA
#> 6  Bishoujo Senshi Sailor Moon S: Kaguya Hime no Koibito   Movie
#> 7                           Bleach: Memories in the Rain Special
#> 8                                              One Piece      TV
#> 9                                            Shaman King      TV
#> 10                                         Winter Garden Special

Evaluation

IBCF

anime_item_acc1 <- calcPredictionAccuracy(x = anime_pred, data = getData(anime_eval, "unknown"), given = 4, goodRating = 5)
anime_item_acc2 <- calcPredictionAccuracy(x = anime_predr, data = getData(anime_eval, "unknown"))

SVD

anime_svd_acc1 <- calcPredictionAccuracy(x = anime_svd_pred, data = getData(anime_eval, "unknown"), given = 4, goodRating = 5)
anime_svd_acc2 <- calcPredictionAccuracy(x = anime_svd_predr, data = getData(anime_eval, "unknown"))

HYBRID

anime_hy_acc1 <- calcPredictionAccuracy(x = anime_hybrid_pred, data = getData(anime_eval, "unknown"), given = 4, goodRating = 5)
anime_hy_acc2 <- calcPredictionAccuracy(x = anime_hybrid_predr, data = getData(anime_eval, "unknown"))

Top-N Animes Recommendation

Recommender	TopN Accuracy
	TP	FP	FN	TN	precision	recall	TPR	FPR
anime_item_acc1	0.2345013	8.210243	98.33154	1610.224	0.0293375	0.0019839	0.0019839	0.0048906
anime_svd_acc1	1.1967655	8.803235	97.36927	1609.631	0.1196765	0.0101774	0.0101774	0.0053067
anime_hy_acc1	0.7358491	9.264151	97.83019	1609.170	0.0735849	0.0055581	0.0055581	0.0056751

Ratings Accuracy

Recommender	Ratings Accuracy
	RMSE	MSE	MAE
anime_item_acc2	1.467894	2.154712	1.043637
anime_svd_acc2	1.622312	2.631897	1.192865
anime_hy_acc2	1.525050	2.325776	1.093793

To sum up this table, the lower the numbers, the better the performance of the model (IBCF).

Model Comparison

#> IBCF run fold/sample [model time/prediction time]
#>   1  [52.68sec/0.17sec] 
#>   2  [52.6sec/0.17sec] 
#>   3  [52.71sec/0.15sec] 
#>   4  [52.29sec/0.18sec] 
#> SVD run fold/sample [model time/prediction time]
#>   1  [1.03sec/0.57sec] 
#>   2  [1.2sec/0.82sec] 
#>   3  [1.09sec/0.81sec] 
#>   4  [1.31sec/0.54sec] 
#> POPULAR run fold/sample [model time/prediction time]
#>   1  [0.01sec/2.03sec] 
#>   2  [0.02sec/2.03sec] 
#>   3  [0.02sec/1.94sec] 
#>   4  [0.01sec/1.91sec] 
#> RANDOM run fold/sample [model time/prediction time]
#>   1  [0.02sec/0.59sec] 
#>   2  [0.01sec/0.57sec] 
#>   3  [0.02sec/0.56sec] 
#>   4  [0.02sec/0.59sec]

ROC_Curve

The closer the curve is to the top right, it indicates a better performance

Precision-Recall Curve

The closer the curve is to the top left, the better the performance. In this case, the Singular Value Decomposition algorithm performed best

Conclusion

Overall, the Hybrid Recommender performed best due to it having the lowest error score. This was expected because when you have a hybrid recommender, the algorithms make up for the shortcomings of each other. As mentioned earlier, Item based recommender had the trouble of recommending items for some new users. This is a problem for collaborative filtering recommenders due to a of lack of enough information where only a few of the total number of items available in a database rated by users. Therefore, there comes the inability to locate successful neighbors and finally, the generation of weak recommendations.

To conclude, recommender systems open new opportunities of retrieving personalized information on the web. It also helps to alleviate the problem of information overload which is a very common circumstance with information retrieval systems and enables users to have access to products and services which are not readily available to users on the system. This prject discussed the three recommendation techniques and highlighted their strengths and weaknesses. Various learning algorithms used in generating the recommendation models and evaluation metrics were used to measure the quality and performance of the algorithms discussed

Anime Recommendation

Anime Recommendation

Fran Sanjaya Lumbangaol

Updated : Januari 31, 2020

Introduction

Data Preparation

Data Exploration

User-Item Matrix

Recommender Systems

Cross Validation

Item-Item Collaborative Filtering

Single Value Decomposition

Hybrid Recommender

Evaluation

Top-N Animes Recommendation

Ratings Accuracy

Model Comparison

Conclusion