Anime Recommendation
MyAnimeList, also known as MAL, is the world’s largest anime and manga database and community which contains a database where users can organize and add different anime to their list. When added to a list the anime items are given a rating after being watched. This process helps in finding users who have similar tastes. This project will explore the contents of this dataset to gain insights. Later on, an item-item collaborative filtering recommeder system will be built to recommend and predict anime for users. Analysis and evaluation will be done on the recommender system to see how well it performs when recommending items.
myanimelist.net API provides anime data and user ratings. The data was obtained from Kaggle Datasets and contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings. The scores/ratings range from 1 - 10 with 10 being the best. If the rating is -1, it means that the user did not provide a rating for that item.
The initial data looked like the dataframe below.
| anime_id | name | genre | type | episodes | rating | members |
|---|---|---|---|---|---|---|
| 32281 | Kimi no Na wa. | Drama, Romance, School, Supernatural | Movie | 1 | 9.37 | 200630 |
| 5114 | Fullmetal Alchemist: Brotherhood | Action, Adventure, Drama, Fantasy, Magic, Military, Shounen | TV | 64 | 9.26 | 793665 |
| 28977 | Gintama° | Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen | TV | 51 | 9.25 | 114262 |
| 9253 | Steins;Gate | Sci-Fi, Thriller | TV | 24 | 9.17 | 673572 |
| 9969 | Gintama' | Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen | TV | 51 | 9.16 | 151266 |
| 32935 | Haikyuu!!: Karasuno Koukou VS Shiratorizawa Gakuen Koukou | Comedy, Drama, School, Shounen, Sports | TV | 10 | 9.15 | 93351 |
The following is an explanation of the features contained in the entire dataset.
Anime.csv
anime_id - myanimelist.net’s unique id identifying an anime.name - full name of anime.genre - comma separated list of genres for this anime.type - movie, TV, OVA, etc.episodes - how many episodes in this show. (1 if movie).rating - average rating out of 10 for this anime.members - number of community members that are in this anime’s “group”.| user_id | anime_id | rating |
|---|---|---|
| 1 | 20 | -1 |
| 1 | 24 | -1 |
| 1 | 79 | -1 |
| 1 | 226 | -1 |
Rating.csv
user_id - non identifiable randomly generated user id.anime_id - the anime that this user has rated.rating - rating out of 10 this user has assigned (-1 if the user watched it but didn’t assign a rating). According to the description found with the data, the ratings are from 1 - 10. Notice that if a user did not rate an item, the item received a rating of -1. For simplicity, I will change -1 to NA to indicate the rating is missing. Added to that, I will also change the data type for some variables.After the dataset is collected, the next step in the process is preprocessing. At this stage we do the process of data wrangling or data mining which in other words is often interpreted as transforming data into a tidy form and ready to be analyzed.
Missing data can be a not so trivial problem when analysing a dataset and accounting for it is usually not so straightforward either. But for this anime dataframe, we can directly remove NA value
type : 25 observationsrating : 205 observationsanime <- anime[anime$type != "",]
anime$type <- droplevels(anime$type)
anime <- anime %>%
drop_na(rating)After observe the data, we can do some preprocesses that will be applied to the dataset are as follows :
anime <- anime %>%
mutate(
anime_id = as.factor(anime_id),
name = as.character(name),
genre = as.character(genre),
episodes = as.numeric(as.character(episodes))
)
ratings <- ratings %>%
mutate(
user_id = as.factor(user_id),
anime_id = as.factor(anime_id)
)Then we can convert -1 in rating column (ratings dataframe) into NA value. There are many ways to approach missing data, such as imputation. Imputation simply means replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values. But in this case I will leave these NA value.
Before we proceed further, the thing that needs to be done before modeling is Exploratory Data Analysis. At this point, we can analyze the distribution of our dataset by their features/variables. The following is the EDA generated from rating.csv data.
Rating Distribution
Rating of all anime is normally distributed and has averages rating (6.473902) between 6.4 and 6.5.
Type Distribution
Then we try to find out how the distribution of the Anime type itself. This can be seen based on the number of Animes contained in each type and of course the Average Rating of animes in those types.
Genre Distribution
#> As we can see, they made 3230 different combinations. We can split them into a single genre list, that would be good for our further analysis
#> [1] "Types of Genre -> "
#> [1] "Comedy" "Action" "Fantasy" "Sci-Fi"
#> [5] "Drama" "Shounen" "Kids" "Adventure"
#> [9] "Romance" "SliceofLife" "School" "Hentai"
#> [13] "Supernatural" "Mecha" "Music" "Historical"
#> [17] "Magic" "Ecchi" "Shoujo" "Sports"
#> [21] "Seinen" "Mystery" "SuperPower" "Military"
#> [25] "Parody" "Space" "Horror" "Harem"
#> [29] "Demons" "MartialArts" "Dementia" "Psychological"
#> [33] "Police" "Game" "Samurai" "Vampire"
#> [37] "Thriller" "Cars" "ShounenAi" "ShoujoAi"
#> [41] "Josei" "Yuri" "Yaoi"
Next, we can transform our data into a Real Rating Matrix to build a recommendation engine. Before proceeding to that stage, we must filter our data for computing reason. We can cut the size of the matrix down where it will only contain data for users who rated at least 500 anime shows and shows that were rated at least 1000 times.
user_filter <- ratings %>%
group_by(user_id) %>%
summarise(n=n()) %>%
filter(n>=500)
#1853
anime_filter <- ratings %>%
group_by(anime_id) %>%
summarise(n=n()) %>%
filter(n>=1000)
#1721
ratings_filter <- ratings %>%
filter(user_id %in% user_filter$user_id,
anime_id %in% anime_filter$anime_id)
anime_matrix <- as(ratings_filter, "realRatingMatrix")#> 1853 x 1721 rating matrix of class 'realRatingMatrix' with 971835 ratings.
From this anime matrix, we can also observe their rating distribution based on what the user has rated.
Based on the users providing the ratings, it seems the shows are really good because majority are rated 8 and up.
To improve the recommendation performance, normalization is always used as a basic component for the predictor models.
There are basically 2 approaches to make a recommendation. Let’s say we want to recommend a set of additional products to a customer who purchased a product X:
Recommender Systems are systems that aim to predict users’ interests and recommend items that are likely to interest them. They help uers make decisions by discovering new and relevant items. As mentioned earlier, we will look at the way three types of recommenders work.
At first we will divide the data into training and test sets so that the recommender algorithms can learn the data then try to predict relevant outcomes.
#> Evaluation scheme with 4 items given
#> Method: 'split' with 4 run(s).
#> Training set proportion: 0.800
#> Good ratings: >=5.000000
#> Data set: 1853 x 1721 rating matrix of class 'realRatingMatrix' with 971835 ratings.
#> Normalized using center on rows.
We are going to create a model called IBCF or I(tem) B(ased) C(ollaborative) F(iltering). Item Based Collaborative Filtering takes the similarities between items’ consumption history.
#> [1] 1482 1721
#> [1] 371 1721
#> Recommender of type 'IBCF' for 'realRatingMatrix'
#> learned using 1482 users.
PREDICT
anime_pred <- predict(object = anime_item_recc, newdata = getData(anime_eval, "known"), n = 10)
anime_predr <- predict(object = anime_item_recc, newdata = getData(anime_eval, "known"), type = "ratings")Let’s see for the first 4 users.
#> $`226`
#> [1] 1041 1125 1254 1264 1323 1325 1329 1336 1410 1482
#>
#> $`1019`
#> [1] 134 159 335 621 724 745 856 868 915 918
#>
#> $`1504`
#> [1] 7 19 24 32 42 67 91 111 112 113
#>
#> $`1522`
#> [1] 578 794 835 889 925 948 958 965 976 995
#>
#> $`1530`
#> [1] 55 70 71 79 81 105 134 159 482 560
#>
#> $`1984`
#> [1] 122 261 262 416 417 418 419 420 463 476
#>
#> $`2273`
#> [1] 935 1175 1176 1402 1562 1579 1675 1686 7 23
#>
#> $`2297`
#> [1] 725 760 769 843 846 863 957 976 1180 1301
Notice that for some users, items were not recommended to them. Here we have the cold start problem. The recommender does not have adequate information about a user or an item in order to make relevant predictions. This happens often with collaborative filtering recommender systems and such problems reduces performance. The profile of such new user or item will be empty since he has not rated any item hence, their taste is not known to the system.
Let’s see what were actually recommended for the some users.
#> [[1]]
#> name type
#> 1 Gokinjo Monogatari TV
#> 2 Virtua Fighter TV
#> 3 Crayon Shin-chan TV
#> 4 Oruchuban Ebichu TV
#> 5 Sentou Yousei Yukikaze OVA
#> 6 Seikai no Danshou: Tanjou Special
#> 7 Hikari to Mizu no Daphne TV
#>
#> [[2]]
#> name type
#> 1 Witch Hunter Robin TV
#> 2 Monster TV
#> 3 School Rumble TV
#> 4 Neon Genesis Evangelion: The End of Evangelion Movie
#> 5 Basilisk: Kouga Ninpou Chou TV
#> 6 Mobile Suit Gundam Wing: Endless Waltz OVA
#> 7 Corrector Yui TV
#> 8 Chou Henshin Cosprayers TV
#> 9 Uchuu no Stellvia TV
#>
#> [[3]]
#> name type
#> 1 Gunslinger Girl TV
#> 2 Boukyaku no Senritsu TV
#> 3 Matantei Loki Ragnarok TV
#> 4 Night Walker: Mayonaka no Tantei TV
#> 5 Enzai OVA
#> 6 Utawarerumono TV
#> 7 Slayers Great Movie
#> 8 Ginga Densetsu Weed TV
#> 9 Gintama TV
#>
#> [[4]]
#> [1] name type
#> <0 rows> (or 0-length row.names)
#> Recommender of type 'SVD' for 'realRatingMatrix'
#> learned using 1482 users.
PREDICT
anime_svd_pred <- predict(object = anime_SVD_recc, newdata = getData(anime_eval, "known"), n = 10)
anime_svd_predr <- predict(object = anime_SVD_recc, newdata = getData(anime_eval, "known"), type = "ratings") Lets see what SVD recommends for the first 4 users
#> $`226`
#> [1] 1286 1405 1220 1315 1160 1330 1306 1360 1296 1254
#>
#> $`1019`
#> [1] 1159 1432 1636 1453 1502 1506 1677 1508 1656 1646
#>
#> $`1504`
#> [1] 1064 1552 1665 961 1685 1114 1650 1700 937 914
#>
#> $`1522`
#> [1] 1162 536 1528 1263 1183 487 343 1031 1225 1348
#>
#> $`1530`
#> [1] 591 482 1159 853 762 1031 512 1103 487 458
#>
#> $`1984`
#> [1] 208 591 797 482 551 184 632 1416 341 322
#>
#> $`2273`
#> [1] 1709 1634 1039 1718 378 1288 1556 628 1624 1291
#>
#> $`2297`
#> [1] 591 1263 1269 482 760 1156 1354 1229 1231 997
Unlike Item recommender, the SVD algorithm provided a recommendation for every user. In general, SVD is a commonly used method to estimate missing data in a data matrix. When you consider that recommender systems are essentially trying to estimate missing ratings for users, the use of SVD makes sense. Comparing to the IBCF, some are the same.
Now let’s have a look at what the numbers match to.
#> [[1]]
#> name type
#> 1 Kachou Ouji TV
#> 2 Hyper Police TV
#> 3 Variable Geo OVA
#> 4 Comic Party Revolution TV
#> 5 Black Lagoon TV
#> 6 Ginga Eiyuu Densetsu OVA
#> 7 Suika OVA
#> 8 Yoshinaga-san'chi no Gargoyle TV
#> 9 HeatGuy J TV
#>
#> [[2]]
#> name type
#> 1 Mazinkaiser OVA
#> 2 Hiatari Ryoukou! Yume no Naka ni Kimi ga Ita Movie
#> 3 Babel Nisei (1992) OVA
#> 4 Virtua Fighter TV
#> 5 Duel Masters TV
#> 6 Uchuu Senkan Yamato TV
#> 7 Appleseed OVA
#> 8 Ike! Ina-chuu Takkyuubu TV
#>
#> [[3]]
#> name type
#> 1 Geobreeders 2: Mouryou Yuugekitai File-XX Ransen Toppa OVA
#> 2 Lupin III: Fuuma Ichizoku no Inbou Movie
#> 3 Ai Shimai Tsubomi... Kegashite Kudasai OVA
#> 4 Maison Ikkoku TV
#> 5 Mahou Shoujo Pretty Sammy (1996) TV
#> 6 Detective Conan Movie 10: Requiem of the Detectives Movie
#> 7 Cosmo Warrior Zero Gaiden Special
#> 8 Sci-fi Harry TV
#> 9 PostPet Momobin TV
#>
#> [[4]]
#> name type
#> 1 Tenjou Tenge: The Past Chapter Movie
#> 2 Tenchi Muyou! Ryououki 3rd Season: Tenchi Seirou naredo Namitakashi? OVA
#> 3 Bishoujo Senshi Sailor Moon S: Kaguya Hime no Koibito Movie
#> 4 Kachou Ouji TV
#> 5 Bakuretsu Hunters TV
#> 6 Hachimitsu to Clover TV
#> 7 Saishuu Heiki Kanojo TV
#> 8 RahXephon Interlude: Her and Herself/Thatness and Thereness OVA
#> 9 Spiral: Suiri no Kizuna TV
The ultimate hybrid recommender containing Item-Item CF, grouped with what the user previously liked, diversity and popular options.
anime_hybrid_recc <- HybridRecommender(
Recommender(data = getData(anime_eval, "train"), method = "IBCF"),
Recommender(data = getData(anime_eval, "train"), method = "POPULAR"),
Recommender(data = getData(anime_eval, "train"), method = "RERECOMMEND"),
Recommender(data = getData(anime_eval, "train"), method = "RANDOM"), #diversity
weights = c(0.5, 0.3, 0.1, 0.1)
)#> Recommender of type 'HYBRID' for 'ratingMatrix'
#> learned using NA users.
PREDICT
anime_hybrid_pred <- predict(object = anime_hybrid_recc, newdata = getData(anime_eval, "known"), n = 10)
anime_hybrid_predr <- predict(object = anime_hybrid_recc, newdata = getData(anime_eval, "known"), type = "ratings") These are what HYBRID recommends for the first 4 users
#> $`226`
#> [1] 1041 1254 1410 762 1336 1482 1125 1039 1325 591
#>
#> $`1019`
#> [1] 1550 1379 1508 868 1672 1348 1564 1707 1240 1187
#>
#> $`1504`
#> [1] 235 529 1493 1656 1195 1432 1704 1327 548 178
#>
#> $`1522`
#> [1] 378 1126 901 1195 559 1263 1291 948 997 1562
#>
#> $`1530`
#> [1] 824 591 105 1620 482 70 134 1327 1159 81
#>
#> $`1984`
#> [1] 1222 262 419 416 476 261 463 420 1204 999
#>
#> $`2273`
#> [1] 1710 1678 1665 710 1656 1275 343 1373 1707 1627
#>
#> $`2297`
#> [1] 760 769 843 1180 1301 725 1354 1605 957 1676
Some of the items recommended by IBCF and SVD did repeat in the hybrid recommeder.
Let’s see the actual items recommended
#> [[1]]
#> name type
#> 1 Virtua Fighter TV
#> 2 Seikai no Danshou: Tanjou Special
#> 3 Gokinjo Monogatari TV
#> 4 Sentou Yousei Yukikaze OVA
#> 5 Crayon Shin-chan TV
#> 6 Oruchuban Ebichu TV
#> 7 One: Kagayaku Kisetsu e - True Stories OVA
#>
#> [[2]]
#> name type
#> 1 Detective Conan TV
#> 2 Saishuu Heiki Kanojo TV
#> 3 Project ARMS: The 2nd Chapter TV
#> 4 PostPet Momobin TV
#> 5 Zero no Tsukaima TV
#> 6 Lupin III: Fuuma Ichizoku no Inbou Movie
#> 7 Eat-Man TV
#> 8 Aoki Densetsu Shoot! TV
#> 9 Wonderful Days Movie
#> 10 Ultra Maniac TV
#>
#> [[3]]
#> name
#> 1 Attack No.1
#> 2 Kino no Tabi: Nanika wo Suru Tame ni - Life Goes On.
#> 3 Sci-fi Harry
#> 4 Slayers Great
#> 5 Bomberman B-Daman Bakugaiden
#> 6 Bubblegum Crash
#> 7 Pokemon Advanced Generation
#> 8 Kinnikuman II Sei
#> 9 Bishoujo Senshi Sailor Moon SuperS: Sailor 9 Senshi Shuuketsu! Black Dream Hole no Kiseki
#> 10 DNA² OVA
#> type
#> 1 TV
#> 2 Movie
#> 3 TV
#> 4 Movie
#> 5 TV
#> 6 OVA
#> 7 TV
#> 8 TV
#> 9 Movie
#> 10 OVA
#>
#> [[4]]
#> name type
#> 1 Beet the Vandel Buster TV
#> 2 Guardian Hearts OVA
#> 3 I: Wish You Were Here TV
#> 4 Figure 17: Tsubasa & Hikaru TV
#> 5 Ai Shimai: Futari no Kajitsu OVA
#> 6 Bishoujo Senshi Sailor Moon S: Kaguya Hime no Koibito Movie
#> 7 Bleach: Memories in the Rain Special
#> 8 One Piece TV
#> 9 Shaman King TV
#> 10 Winter Garden Special
IBCF
anime_item_acc1 <- calcPredictionAccuracy(x = anime_pred, data = getData(anime_eval, "unknown"), given = 4, goodRating = 5)
anime_item_acc2 <- calcPredictionAccuracy(x = anime_predr, data = getData(anime_eval, "unknown"))SVD
anime_svd_acc1 <- calcPredictionAccuracy(x = anime_svd_pred, data = getData(anime_eval, "unknown"), given = 4, goodRating = 5)
anime_svd_acc2 <- calcPredictionAccuracy(x = anime_svd_predr, data = getData(anime_eval, "unknown"))HYBRID
anime_hy_acc1 <- calcPredictionAccuracy(x = anime_hybrid_pred, data = getData(anime_eval, "unknown"), given = 4, goodRating = 5)
anime_hy_acc2 <- calcPredictionAccuracy(x = anime_hybrid_predr, data = getData(anime_eval, "unknown"))| TP | FP | FN | TN | precision | recall | TPR | FPR | |
|---|---|---|---|---|---|---|---|---|
| anime_item_acc1 | 0.2345013 | 8.210243 | 98.33154 | 1610.224 | 0.0293375 | 0.0019839 | 0.0019839 | 0.0048906 |
| anime_svd_acc1 | 1.1967655 | 8.803235 | 97.36927 | 1609.631 | 0.1196765 | 0.0101774 | 0.0101774 | 0.0053067 |
| anime_hy_acc1 | 0.7358491 | 9.264151 | 97.83019 | 1609.170 | 0.0735849 | 0.0055581 | 0.0055581 | 0.0056751 |
| RMSE | MSE | MAE | |
|---|---|---|---|
| anime_item_acc2 | 1.467894 | 2.154712 | 1.043637 |
| anime_svd_acc2 | 1.622312 | 2.631897 | 1.192865 |
| anime_hy_acc2 | 1.525050 | 2.325776 | 1.093793 |
To sum up this table, the lower the numbers, the better the performance of the model (
IBCF).
#> IBCF run fold/sample [model time/prediction time]
#> 1 [52.68sec/0.17sec]
#> 2 [52.6sec/0.17sec]
#> 3 [52.71sec/0.15sec]
#> 4 [52.29sec/0.18sec]
#> SVD run fold/sample [model time/prediction time]
#> 1 [1.03sec/0.57sec]
#> 2 [1.2sec/0.82sec]
#> 3 [1.09sec/0.81sec]
#> 4 [1.31sec/0.54sec]
#> POPULAR run fold/sample [model time/prediction time]
#> 1 [0.01sec/2.03sec]
#> 2 [0.02sec/2.03sec]
#> 3 [0.02sec/1.94sec]
#> 4 [0.01sec/1.91sec]
#> RANDOM run fold/sample [model time/prediction time]
#> 1 [0.02sec/0.59sec]
#> 2 [0.01sec/0.57sec]
#> 3 [0.02sec/0.56sec]
#> 4 [0.02sec/0.59sec]
ROC_Curve
The closer the curve is to the top right, it indicates a better performance
Precision-Recall Curve
The closer the curve is to the top left, the better the performance. In this case, the Singular Value Decomposition algorithm performed best
Overall, the Hybrid Recommender performed best due to it having the lowest error score. This was expected because when you have a hybrid recommender, the algorithms make up for the shortcomings of each other. As mentioned earlier, Item based recommender had the trouble of recommending items for some new users. This is a problem for collaborative filtering recommenders due to a of lack of enough information where only a few of the total number of items available in a database rated by users. Therefore, there comes the inability to locate successful neighbors and finally, the generation of weak recommendations.
To conclude, recommender systems open new opportunities of retrieving personalized information on the web. It also helps to alleviate the problem of information overload which is a very common circumstance with information retrieval systems and enables users to have access to products and services which are not readily available to users on the system. This prject discussed the three recommendation techniques and highlighted their strengths and weaknesses. Various learning algorithms used in generating the recommendation models and evaluation metrics were used to measure the quality and performance of the algorithms discussed