Movie Recommender System with Large Dataset

Objectives

The goal for your final project is for you to build out a recommender system using a large dataset (ex: 1M+ ratings or 10k+ users, 10k+ items. There are three deliverables, with separate dates:

[1] Planning Document Find an interesting dataset and describe the system you plan to build out. If you would like to use one of the datasets you have already worked with, you should add a unique element or incorporate additional data. (i.e. explicit features you scrape from another source, like image analysis on movie posters). The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered. There is no preference for one over the other, as long as your recommender works! The planning document should be written up and published as a notebook on GitHub or in RPubs.Please submit the link in the Unit 4 folder, due Thursday, July 5.

[2] Presentation. Make a five-minute presentation of your system in our final meetup on Tuesday. If you’re not able to attend the meetup, you’re responsible for either recording your presentation, or scheduling one-on-one time to deliver your presentation prior to the meetup. You should be prepared to present on Tuesday. You should use this project to showcase some of the concepts that you have learned in this course, while delivering on the (probably) less familiar Spark platform. You are welcome to submit a compelling alternative proposal (subject to approval), such as implementing a recommender system using in Microsoft Azure ML Studio or with Google TensorFlow, or building out an application of a certain complexity using another tool. You may work in a small group (2-3) on this assignment.

[3] Implementation. In this final project deliverable, you’ll build out the system that you describe in your planning document. This will be due on Thursday and must be turned in as an RMarkdown file or a Jupyter notebook, and posted to GitHub or RPubs.com.

Libraries

library(recommenderlab)
library(reshape2)
library(RCurl)
library(ggplot2)
library(knitr)
library(kableExtra)
library(dplyr)
library(tidyr)
library(ggplot2)
library(tictoc)

Preamble

In our proposal, we said that we would use full file of movielense dataset from section “recommended for education and development” of site https://grouplens.org/datasets/movielens/. But, the dataset was so large that it ran out of memory, while creating the matrix. The error message is provided below.

Error: cannot allocate vector of size 113.7 Gb

So, we changed our plan and got similar data, of lower volume, from Kaggle that also fulfills the minimum requirements of the project. The Kaggle data is described below.

Data

MyAnimeList, often abbreviated as MAL, is an anime and manga social networking and social cataloging application website. The site provides its users with a list-like system to organize and score anime and manga. It facilitates finding users who share similar tastes and provides a large database on anime and manga. In 2018, MyAnimeList reported having approximately 15,000 anime and 45,000 manga entries. In 2015, the site received 120 million visitors a month.

We gathered data from Kaggle. Kaggle provides two csv. Description of the data is as follows:

This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.

Anime.csv

anime_id - myanimelist.net’s unique id identifying an anime.
name - full name of anime.
genre - comma separated list of genres for this anime.
type - movie, TV, OVA, etc.
episodes - how many episodes in this show. (1 if movie).
rating - average rating out of 10 for this anime.
members - number of community members that are in this anime’s “group”.

Rating.csv

user_id - non identifiable randomly generated user id.
anime_id - the anime that this user has rated.
rating - rating out of 10 this user has assigned (-1 if the user watched it but didn’t assign a rating).

Load Data

anime <- read.csv("./anime.csv")
anime_ratings <- read.csv("./rating.csv")

Preview data

#Preview anime data
kable(head(anime, n = 10L)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(0, bold = T, color = "white", background = "#fc5e5e") %>%
    scroll_box(width = "100%", height = "200px")

anime_id	name	genre	type	episodes	rating	members
32281	Kimi no Na wa.	Drama, Romance, School, Supernatural	Movie	1	9.37	200630
5114	Fullmetal Alchemist: Brotherhood	Action, Adventure, Drama, Fantasy, Magic, Military, Shounen	TV	64	9.26	793665
28977	GintamaÂ°	Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen	TV	51	9.25	114262
9253	Steins;Gate	Sci-Fi, Thriller	TV	24	9.17	673572
9969	Gintama'	Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen	TV	51	9.16	151266
32935	Haikyuu!!: Karasuno Koukou VS Shiratorizawa Gakuen Koukou	Comedy, Drama, School, Shounen, Sports	TV	10	9.15	93351
11061	Hunter x Hunter (2011)	Action, Adventure, Shounen, Super Power	TV	148	9.13	425855
820	Ginga Eiyuu Densetsu	Drama, Military, Sci-Fi, Space	OVA	110	9.11	80679
15335	Gintama Movie: Kanketsu-hen - Yorozuya yo Eien Nare	Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen	Movie	1	9.10	72534
15417	Gintama': Enchousen	Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen	TV	13	9.11	81109

glimpse(anime)   # glimpse

## Rows: 12,294
## Columns: 7
## $ anime_id <int> 32281, 5114, 28977, 9253, 9969, 32935, 11061, 820, 15335, ...
## $ name     <fct> Kimi no Na wa., Fullmetal Alchemist: Brotherhood, GintamaÂ...
## $ genre    <fct> "Drama, Romance, School, Supernatural", "Action, Adventure...
## $ type     <fct> Movie, TV, TV, TV, TV, TV, TV, OVA, Movie, TV, TV, Movie, ...
## $ episodes <fct> 1, 64, 51, 24, 51, 10, 148, 110, 1, 13, 24, 1, 201, 25, 25...
## $ rating   <dbl> 9.37, 9.26, 9.25, 9.17, 9.16, 9.15, 9.13, 9.11, 9.10, 9.11...
## $ members  <int> 200630, 793665, 114262, 673572, 151266, 93351, 425855, 806...

#Preview ratings data
kable(head(anime_ratings, n = 10L)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(0, bold = T, color = "white", background = "#fc5e5e") %>%
    scroll_box(width = "100%", height = "200px")

user_id	anime_id	rating
1	20	-1
1	24	-1
1	79	-1
1	226	-1
1	241	-1
1	355	-1
1	356	-1
1	442	-1
1	487	-1
1	846	-1

glimpse(anime_ratings)  # glimpse ratings

## Rows: 7,813,737
## Columns: 3
## $ user_id  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ anime_id <int> 20, 24, 79, 226, 241, 355, 356, 442, 487, 846, 936, 1546, ...
## $ rating   <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1...

Clean the Data

As mentioned above, if the user watched but didn’t assign a rating, then corresponding data field has -1. So, we converted the unrated data to ‘NA’, and changed the data type for downstream analysis.

anime_ratings$rating[anime_ratings$rating == -1] <- NA     # -1 if the user watched it but didn't assign a rating
anime_sp <- anime_ratings
anime_ratings$user_id <- as.factor(anime_ratings$user_id)
anime_ratings$anime_id <- as.factor(anime_ratings$anime_id)
anime$anime_id <- as.factor(anime$anime_id)
anime$name <- as.character(anime$name)
anime$type <- as.character(anime$type)
anime$genre <- as.character(anime$genre)

glimpse(anime_ratings)

## Rows: 7,813,737
## Columns: 3
## $ user_id  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ anime_id <fct> 20, 24, 79, 226, 241, 355, 356, 442, 487, 846, 936, 1546, ...
## $ rating   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

glimpse(anime)

## Rows: 12,294
## Columns: 7
## $ anime_id <fct> 32281, 5114, 28977, 9253, 9969, 32935, 11061, 820, 15335, ...
## $ name     <chr> "Kimi no Na wa.", "Fullmetal Alchemist: Brotherhood", "Gin...
## $ genre    <chr> "Drama, Romance, School, Supernatural", "Action, Adventure...
## $ type     <chr> "Movie", "TV", "TV", "TV", "TV", "TV", "TV", "OVA", "Movie...
## $ episodes <fct> 1, 64, 51, 24, 51, 10, 148, 110, 1, 13, 24, 1, 201, 25, 25...
## $ rating   <dbl> 9.37, 9.26, 9.25, 9.17, 9.16, 9.15, 9.13, 9.11, 9.10, 9.11...
## $ members  <int> 200630, 793665, 114262, 673572, 151266, 93351, 425855, 806...

Data Exploration

Highest rated animes

anime %>% arrange(desc(rating)) %>% 
  top_n(10) %>% kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(0, bold = T, color = "white", background = "#fc5e5e") %>%
    scroll_box(width = "100%", height = "200px")

## Selecting by members

anime_id	name	genre	type	episodes	rating	members
5114	Fullmetal Alchemist: Brotherhood	Action, Adventure, Drama, Fantasy, Magic, Military, Shounen	TV	64	9.26	793665
9253	Steins;Gate	Sci-Fi, Thriller	TV	24	9.17	673572
1575	Code Geass: Hangyaku no Lelouch	Action, Mecha, Military, School, Sci-Fi, Super Power	TV	25	8.83	715151
1535	Death Note	Mystery, Police, Psychological, Supernatural, Thriller	TV	37	8.71	1013917
16498	Shingeki no Kyojin	Action, Drama, Fantasy, Shounen, Super Power	TV	25	8.54	896229
4224	Toradora!	Comedy, Romance, School, Slice of Life	TV	25	8.45	633817
6547	Angel Beats!	Action, Comedy, Drama, School, Supernatural	TV	13	8.39	717796
10620	Mirai Nikki (TV)	Action, Mystery, Psychological, Shounen, Supernatural, Thriller	TV	26	8.07	657190
11757	Sword Art Online	Action, Adventure, Fantasy, Game, Romance	TV	25	7.83	893100
20	Naruto	Action, Comedy, Martial Arts, Shounen, Super Power	TV	220	7.81	683297

Most watched type of show

anime %>% count(type)%>% 
  ggplot(aes(x = type, y = n)) + 
  geom_bar(stat = "identity", fill = "light blue" ) + 
  geom_text(aes(label=n), vjust= -0.6, color="black", size=3.5) +
  theme_minimal()

Anime with the most members

anime %>% arrange(desc(members)) %>% 
  top_n(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(0, bold = T, color = "white", background = "#fc5e5e") %>%
    scroll_box(width = "100%", height = "200px")

## Selecting by members

anime_id	name	genre	type	episodes	rating	members
1535	Death Note	Mystery, Police, Psychological, Supernatural, Thriller	TV	37	8.71	1013917
16498	Shingeki no Kyojin	Action, Drama, Fantasy, Shounen, Super Power	TV	25	8.54	896229
11757	Sword Art Online	Action, Adventure, Fantasy, Game, Romance	TV	25	7.83	893100
5114	Fullmetal Alchemist: Brotherhood	Action, Adventure, Drama, Fantasy, Magic, Military, Shounen	TV	64	9.26	793665
6547	Angel Beats!	Action, Comedy, Drama, School, Supernatural	TV	13	8.39	717796
1575	Code Geass: Hangyaku no Lelouch	Action, Mecha, Military, School, Sci-Fi, Super Power	TV	25	8.83	715151
20	Naruto	Action, Comedy, Martial Arts, Shounen, Super Power	TV	220	7.81	683297
9253	Steins;Gate	Sci-Fi, Thriller	TV	24	9.17	673572
10620	Mirai Nikki (TV)	Action, Mystery, Psychological, Shounen, Supernatural, Thriller	TV	26	8.07	657190
4224	Toradora!	Comedy, Romance, School, Slice of Life	TV	25	8.45	633817

Create Matrix

AnimeMatrix <- as(anime_ratings, "realRatingMatrix")
AnimeMatrix

## 73515 x 11200 rating matrix of class 'realRatingMatrix' with 7813730 ratings.

dim(AnimeMatrix)

## [1] 73515 11200

object.size(AnimeMatrix)

## 99233736 bytes

Selecting the most relevant data

On exploring the data, we noticed that the table contains:

Ratings of the animes that have been viewed only a few times, and therefore might be biased. So, we’ll keep movies that have been watched at least 1000 times.
Ratings of the Users, who rated only a few movies, might be biased too. So, we’ll keep users, who have rated at least 500 anime shows.

AnimeMatrix <- AnimeMatrix[rowCounts(AnimeMatrix) > 500, colCounts(AnimeMatrix) > 1000]
AnimeMatrix

## 1843 x 1720 rating matrix of class 'realRatingMatrix' with 967727 ratings.

object.size(AnimeMatrix)

## 11850056 bytes

Data Visualization

Exploring the values of the rating

# Vectorize and create unique vector.
vector_ratings <- as.vector(AnimeMatrix@data@x)
unique(vector_ratings)

##  [1] NA  8  7  9 10  2  4  5  6  1  3

# The ratings are in the range of 0-10. Let's count the occurrences of each of them.
table_ratings <- table(vector_ratings)
kable(table_ratings)

vector_ratings	Freq
1	2015
2	3081
3	5918
4	14210
5	38643
6	90254
7	191638
8	218259
9	135001
10	83198

table_ratings <- as.data.frame(table(as.vector(AnimeMatrix@data@x)))

ggplot(table_ratings, aes(x = Var1, y = Freq, fill = Var1)) + 
  geom_bar(stat = "identity") + 
  ggtitle("Distribution of Ratings for Anime Items") +
  geom_text(aes(label=Freq), vjust= -0.6, color="black", size=3.5) +
  theme(legend.position="none") + xlab("Rating Score") + ylab("Fequency")

summary(AnimeMatrix@data@x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    7.00    8.00    7.61    9.00   10.00  185510

Exploring which animes have been viewed

views_per_anime <- colCounts(AnimeMatrix)

table_views <- data.frame(
anime_names = names(views_per_anime),
views = views_per_anime
)
names(table_views)[names(table_views) == "anime_names"] <- "anime_id"
table_views <- merge(table_views, anime, by="anime_id")
table_views <- table_views[order(table_views$views, decreasing =
TRUE), ]

ggplot(table_views[1:10, ], aes(x = name, y = views)) +
geom_bar(stat="identity") + theme(axis.text.x =
element_text(angle = 45, hjust = 1)) + ggtitle("Number of views of the top 10 animes") + geom_text(aes(label=views), vjust= -0.6, color="black", size=3.5)

Exploring the average ratings

average_ratings <- data.frame("avg_rating" = colMeans(AnimeMatrix)) %>% 
  ggplot(aes(x = avg_rating)) + 
  geom_histogram(color = "black", fill = "lightblue", binwidth = 0.1) + 
  theme( axis.line = element_line(colour = "darkblue", size = 1, linetype = "solid"))+
  ggtitle("Distribution of Average Ratings for Anime Shows")
average_ratings

Data Normalization

anime_Normalization <- normalize(AnimeMatrix)
avg <- round(rowMeans(anime_Normalization), 5)

image(AnimeMatrix[1:100, 1:100], main = "First 100 users and anime items: Top Anime (Non-Normalized)")

image(anime_Normalization[1:100, 1:100], main = "First 100 users and anime items: Top Anime(Normalized)")

Similarity

Similarity among the first 50 users

sim_user <- similarity(anime_Normalization[1:100, ], method = "cosine", which = "users")
image(as.matrix(sim_user), main = "User Similarity")

Similarity among the first 50 anime items.

sim_item <- similarity(anime_Normalization[, 1:100], method = "cosine", which = "items")
image(as.matrix(sim_item), main = "Item Similarity")

Based on the similarity plots, items have more in common than users with each other.

Recommendation algorithms

Split the dataset into training set (80%) and testing set (20%):

# min(rowCounts(anime_Normalization) = 4 so we can keep 4 items per user
anime_evaluation <- evaluationScheme(data = anime_Normalization, method = "split", train = 0.8, given = 4, goodRating = 5, k = 4) 
anime_evaluation

## Evaluation scheme with 4 items given
## Method: 'split' with 4 run(s).
## Training set proportion: 0.800
## Good ratings: >=5.000000
## Data set: 1843 x 1720 rating matrix of class 'realRatingMatrix' with 967727 ratings.
## Normalized using center on rows.

Item-Item Collaborative Filtering

This is a filtering method, where similarity between items is calculated using users’ ratings of items. That means the algorithm recommends items similar to the users’ previous selections. In the algorithm, the similarities between different items are computed by one of the similarity measures, and then similarity values are used to predict ratings for user-item pairs absent in the data.

Training model

In below step we’ll train the model.

tic("IBCF Model Training")
(model_IBCF <- Recommender(data = getData(anime_evaluation, "train"), method = "IBCF"))

## Warning in .local(x, ...): x was already normalized by row!

## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 1474 users.

toc(log = T, quiet = T)

Examining the Similarity Matrix

similarityMatrix <- getModel(model_IBCF)$sim
which_max <- order(colSums(similarityMatrix > 0), decreasing = TRUE)[1:10]
topAnimes <- as.data.frame(as.integer(rownames(similarityMatrix)[which_max]))
colnames(topAnimes) <- c("anime_id")

topAnimes$anime_id <- as.factor(topAnimes$anime_id)

data <- topAnimes %>% inner_join(anime, by = "anime_id") %>% select(Anime_name = "name")

kable((data)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(0, bold = T, color = "white", background = "#fc5e5e") %>%
    scroll_box(width = "100%", height = "200px")

Anime_name
Yume-iro PÃ¢tissiÃ¨re
Inazuma Eleven
Haikyuu!!: Karasuno Koukou VS Shiratorizawa Gakuen Koukou
Shouwa Genroku Rakugo Shinjuu
Koisuru Boukun
Kamikaze Kaitou Jeanne
Kimi no Na wa.
Hanasakeru Seishounen
Mermaid Melody Pichi Pichi Pitch
GintamaÂ°

Predict

tic("IBCF Model Predicting: top 10")
anime_pred <- predict(object = model_IBCF, newdata = getData(anime_evaluation, "known"), n = 10)
toc(log = T, quiet = T)

tic("IBCF Model Predicting: ratings")
anime_predr <- predict(object = model_IBCF, newdata = getData(anime_evaluation, "known"), type = "ratings")
toc(log = T, quiet = T)

# first 8 users recommendations
anime_pred@items[1:8]

## $`226`
##  [1]  173  436  507  993 1059 1107 1390 1453 1525 1569
## 
## $`392`
##  [1]  540  702  868  930  972  991 1193 1341  286  317
## 
## $`478`
##  [1]   65  638  826 1090 1371 1399 1449 1476 1536 1544
## 
## $`804`
## integer(0)
## 
## $`2632`
##  [1]  643  870 1360   43   50  107  114  163  180  222
## 
## $`3009`
##  [1] 477   1  16  18  24 135 171 181 195 217
## 
## $`3117`
##  [1]  15  22 130 132 139 158 213 358 455 464
## 
## $`3338`
##  [1]  68  74 155 262 415 721 724 818 822 844

Due to lack of historical data, sometimes the IBCF model may not recommend any items for one or more users.

# function to match anime id with names of anime items
item_recc <- function(i){
p <- anime_pred@items[[i]]
p <- data.frame("id" = as.factor(p))
p <- inner_join(p, anime, by = c("id" = "anime_id")) %>% select(name, type)
return(as.data.frame(p))
}

for_users <- c(20, 3, 2, 10)
lapply(for_users, item_recc)

## [[1]]
##                                          name  type
## 1                           Appleseed (Movie) Movie
## 2                                     Avenger    TV
## 3        Mobile Suit Gundam: The 08th MS Team   OVA
## 4  Mobile Suit Gundam 0080: War in the Pocket   OVA
## 5                                      X/1999 Movie
## 6                                           X    TV
## 7               Sen to Chihiro no Kamikakushi Movie
## 8                                    Planetes    TV
## 9               InuYasha: Guren no Houraijima Movie
## 10        InuYasha: Kagami no Naka no Mugenjo Movie
## 
## [[2]]
##                                    name  type
## 1               Rozen Maiden: TrÃ¤umend    TV
## 2                  Mobile Suit Gundam I Movie
## 3                            Hi no Tori    TV
## 4                      Macross 7 Encore   OVA
## 5 City Hunter: Hyakuman Dollar no Inbou   OVA
## 6                          Busou Renkin    TV
## 7   Super Robot Taisen OG The Animation   OVA
## 
## [[3]]
##                                 name  type
## 1  Tenchi Muyou! Ryououki 2nd Season   OVA
## 2                      Slayers Great Movie
## 3              Densetsu Kyojin Ideon    TV
## 4                 Bible Black Gaiden   OVA
## 5                Usagi-chan de Cue!!   OVA
## 6          Happy Seven: The TV Manga    TV
## 7   Violence Jack: Harlem Bomber-hen   OVA
## 8                         B&#039;T X    TV
## 9 Final Fantasy VII: Advent Children Movie
## 
## [[4]]
##                                   name type
## 1                               Naruto   TV
## 2           Kidou Tenshi Angelic Layer   TV
## 3                          Arc the Lad   TV
## 4                              Chobits   TV
## 5          Basilisk: Kouga Ninpou Chou   TV
## 6 Mahou Shoujo Lyrical Nanoha A&#039;s   TV
## 7                             Shuffle!   TV
## 8                           Boys Be...   TV
## 9                      Chuuka Ichiban!   TV

Single Value Decomposition

Please refer RPubs link for our detailed explanation of SVD, which we provided in Project 3.

Training

tic ("SVD Model Training")
(model_svd <- Recommender(data = getData(anime_evaluation, "train"), method = "SVD"))

## Warning in .local(x, ...): x was already normalized by row!

## Recommender of type 'SVD' for 'realRatingMatrix' 
## learned using 1474 users.

toc(log = T, quiet = T)

Predict

tic ("SVD Model predicting: top 10")
anime_svd_pred <- predict(object = model_svd, newdata = getData(anime_evaluation, "known"), n = 10) 
toc(log = T, quiet = T)

tic ("SVD Model predicting: ratings")
anime_svd_predr <- predict(object = model_svd, newdata = getData(anime_evaluation, "known"), type = "ratings")
toc(log = T, quiet = T)

# first 8 users recommendations
anime_svd_pred@items[1:8]

## $`226`
##  [1] 1569 1684 1632 1618 1665 1699 1635 1551 1679 1655
## 
## $`392`
##  [1] 1158    9 1505  505  159 1152 1429  133  458 1386
## 
## $`478`
##  [1] 1635 1569 1665 1618 1632 1699 1676 1589 1679 1600
## 
## $`804`
##  [1] 516 429 521 605 537 665 705 644  38 529
## 
## $`2632`
##  [1] 172 285 407 214 254 321  99 217 567 288
## 
## $`3009`
##  [1]  16  18 135 583   1 671 235  21  24 333
## 
## $`3117`
##  [1] 494 587 467   9 159 219 737 761 545 869
## 
## $`3338`
##  [1]  575  669  708 1158 1635  670  495  768  709 1671

As opposed to IBCF, the SVD algorithm provides a recommendation for every user. It’s a reliable practice to use SVD, to estimate missing data in a data matrix.

# function to match anime id with names of anime items
svd_recc <- function(i){
p <- anime_svd_pred@items[[i]]
p <- data.frame("id" = as.factor(p))
p <- inner_join(p, anime, by = c("id" = "anime_id")) %>% select(name, type)
return(as.data.frame(p))
}

for_users <- c(20, 3, 2, 10)
lapply(for_users, svd_recc)

## [[1]]
##                                      name    type
## 1 Ghost in the Shell: Stand Alone Complex      TV
## 2                     eX-Driver the Movie   Movie
## 3                                 Mizuiro     OVA
## 4                                Shuffle!      TV
## 5              Girls Bravo: Second Season      TV
## 6                           Buttobi!! CPU     OVA
## 7             Mazeâ\230†Bakunetsu Jikuu (TV)      TV
## 8     Pokemon: Senritsu no Mirage Pokemon Special
## 
## [[2]]
##                                            name type
## 1                Ai Shimai 2: Futari no Kajitsu  OVA
## 2                   Otome wa Boku ni Koishiteru   TV
## 3                            Babel Nisei (1992)  OVA
## 4  Soreyuke! Uchuu Senkan Yamamoto Yohko (1999)   TV
## 5                              Shintaisou: Shin  OVA
## 6                                Romeo x Juliet   TV
## 7                            Cosmo Warrior Zero   TV
## 8                                     Bartender   TV
## 9                          Green Green Specials  OVA
## 10                            Galaxy Angel Rune   TV
## 
## [[3]]
##                                                  name    type
## 1               Geobreeders: File-X Chibi Neko Dakkan     OVA
## 2 Detective Conan Movie 09: Strategy Above the Depths   Movie
## 3                        Fushigiboshi noâ\230†Futagohime      TV
## 4                                Boukyaku no Senritsu      TV
## 5                            New Dominion Tank Police     OVA
## 6                         Lupin III: Nusumareta Lupin Special
## 7                                         Green Green      TV
## 8                                       Buttobi!! CPU     OVA
## 9                                        Blood Royale     OVA
## 
## [[4]]
##                                        name  type
## 1                         Kage kara Mamoru!    TV
## 2           Yuâ\230†Giâ\230†Oh!: Duel Monsters GX    TV
## 3                        Gokinjo Monogatari    TV
## 4                          Tokyo Godfathers Movie
## 5      Grappler Baki: Saidai Tournament-hen    TV
## 6 Bishoujo Senshi Sailor Moon: Sailor Stars    TV
## 7                           WeiÃŸ Kreuz OVA   OVA
## 8                              Zetsuai 1989   OVA
## 9                                    Blame!   ONA

Hybrid Recommender

In order to incorporate serendipity, novelty, or diversity we created a hybrid model, where we used the following weights:

50% for IBCF
30% for POPULAR
10% for RERECOMMEND
10% for RANDOM

Training

tic("Hybrid Recommender Training")
model_hybrid <- HybridRecommender(
  Recommender(data = getData(anime_evaluation, "train"), method = "IBCF"),
  Recommender(data = getData(anime_evaluation, "train"), method = "POPULAR"),
  Recommender(data = getData(anime_evaluation, "train"), method = "RERECOMMEND"),
  Recommender(data = getData(anime_evaluation, "train"), method = "RANDOM"),
  weights = c(0.5, 0.3, 0.1, 0.1)   # diversity
)

## Warning in .local(x, ...): x was already normalized by row!

## Warning in .local(x, ...): x was already normalized by row!

toc(log = T, quiet = T)

Predict

tic("Hybrid Recommender Predicting: top 10")
anime_hybrid_pred <- predict(object = model_hybrid, newdata = getData(anime_evaluation, "known"), n = 10) 
toc(log = T, quiet = T)

tic("Hybrid Recommender Predicting: ratings")
anime_hybrid_predr <- predict(object = model_hybrid, newdata = getData(anime_evaluation, "known"), type = "ratings") 
toc(log = T, quiet = T)

# first 8 users recommendations
anime_hybrid_pred@items[1:8]

## $`226`
##  [1] 1038   45  671  996  728  590  226    8 1633  139
## 
## $`392`
##  [1] 1038 1024  996  705  761 1633 1041 1125  529   70
## 
## $`478`
##  [1]  482 1038  858  283  226  154 1633  529  671 1287
## 
## $`804`
##  [1]  333 1708  482  378  558  698 1125 1038 1492  600
## 
## $`2632`
##  [1]  761 1633 1038  698    8 1708  245 1185 1717  378
## 
## $`3009`
##  [1]  616 1229  810   18  724  271  691  477 1347  154
## 
## $`3117`
##  [1] 1038  996  332 1185 1671  590   15  467  558 1161
## 
## $`3338`
##  [1] 1287 1186 1125 1185 1655  627  567  425  285  154

Some of the items recommended by IBCF and SVD did repeat in the hybrid recommeder.

Let’s see the actual items recommended.

# function to match anime id with names of anime items
hybrid_recc <- function(i){
p <- anime_hybrid_pred@items[[i]]
p <- data.frame("id" = as.factor(p))
p <- inner_join(p, anime, by = c("id" = "anime_id")) %>% select(name, type)
return(as.data.frame(p))
}

for_users <- c(20, 3, 2, 10)
lapply(for_users, hybrid_recc)

## [[1]]
##                                        name    type
## 1 Naruto: Akaki Yotsuba no Clover wo Sagase Special
## 2 Bishoujo Senshi Sailor Moon: Sailor Stars      TV
## 3                       eX-Driver the Movie   Movie
## 4           Yuâ\230†Giâ\230†Oh!: Duel Monsters GX      TV
## 5   Haru no Ashioto The Movie: Ourin Dakkan   Movie
## 6                          Shintaisou: Kari     OVA
## 7  Soreyuke! Uchuu Senkan Yamamoto Yohko II     OVA
## 8                   Ojamajo Doremi Dokkaan!      TV
## 
## [[2]]
##                               name  type
## 1  Yuâ\230†Giâ\230†Oh!: Duel Monsters GX    TV
## 2                          Mizuiro   OVA
## 3              Gunparade Orchestra    TV
## 4                    Akage no Anne    TV
## 5                       Elfen Lied    TV
## 6                      Shaman King    TV
## 7                 Shintaisou: Kari   OVA
## 8             Saishuu Heiki Kanojo    TV
## 9              Lemon Angel Project    TV
## 10   Odin: Koushi Hansen Starlight Movie
## 
## [[3]]
##                                                         name    type
## 1                                                    Mizuiro     OVA
## 2                                            WeiÃŸ Kreuz OVA     OVA
## 3                  Bishoujo Senshi Sailor Moon: Sailor Stars      TV
## 4 Chou Henshin Cosprayers vs. Ankoku Uchuu Shougun the Movie   Movie
## 5                  Naruto: Akaki Yotsuba no Clover wo Sagase Special
## 6                                           Shintaisou: Kari     OVA
## 7                                               Zetsuai 1989     OVA
## 8                                       Saishuu Heiki Kanojo      TV
## 
## [[4]]
##                                        name    type
## 1                          Shintaisou: Kari     OVA
## 2                                   Mizuiro     OVA
## 3              Battle Athletess Daiundoukai     OVA
## 4             Mobile Police Patlabor: WXIII   Movie
## 5 Bishoujo Senshi Sailor Moon: Sailor Stars      TV
## 6 Naruto: Akaki Yotsuba no Clover wo Sagase Special
## 7                    Android Ana Maico 2010      TV
## 8                        Petshop of Horrors      TV

Calculating and comparing accuracies

IBCF

anime_item_acc1 <- calcPredictionAccuracy(x = anime_pred, data = getData(anime_evaluation, "unknown"), given = 4, goodRating = 5)
anime_item_acc2 <- calcPredictionAccuracy(x = anime_predr, data = getData(anime_evaluation, "unknown"))

SVD

anime_svd_acc1 <- calcPredictionAccuracy(x = anime_svd_pred, data = getData(anime_evaluation, "unknown"), given = 4, goodRating = 5)
anime_svd_acc2 <- calcPredictionAccuracy(x = anime_svd_predr, data = getData(anime_evaluation, "unknown"))

Hybrid

anime_hy_acc1 <- calcPredictionAccuracy(x = anime_hybrid_pred, data = getData(anime_evaluation, "unknown"), given = 4, goodRating = 5)
anime_hy_acc2 <- calcPredictionAccuracy(x = anime_hybrid_predr, data = getData(anime_evaluation, "unknown"))

TopN

kable(rbind(anime_item_acc1, anime_svd_acc1, anime_hy_acc1)) %>% kable_styling(c("striped", "hovered", "bordered"), font_size = 12, full_width = F) %>% add_header_above(c("Recommender", "TopN Accuracy" = 8))

Recommender	TopN Accuracy
	TP	FP	FN	TN	precision	recall	TPR	FPR
anime_item_acc1	0.2710027	8.065041	109.1463	1598.518	0.0382322	0.0055655	0.0055655	0.0048232
anime_svd_acc1	1.3929539	8.607046	108.0244	1597.976	0.1392954	0.0066003	0.0066003	0.0052008
anime_hy_acc1	0.9186992	9.081301	108.4986	1597.501	0.0918699	0.0082115	0.0082115	0.0055893

Ratings

kable(rbind(anime_item_acc2, anime_svd_acc2, anime_hy_acc2)) %>% kable_styling(c("striped", "hovered", "bordered"), font_size = 12, full_width = 80) %>% add_header_above(c("Recommender", "Ratings Accuracy" = 3))

Recommender	Ratings Accuracy
	RMSE	MSE	MAE
anime_item_acc2	1.546860	2.392774	1.094808
anime_svd_acc2	1.673719	2.801336	1.224757
anime_hy_acc2	1.595474	2.545536	1.160603

ROC Curve

models_evaluate <- list(
  IBCF = list(name = "IBCF", param = list(method = "cosine")),
  SVD = list(name = "SVD", param = list(k = 30)),
  POPULAR = list(name = "POPULAR", param = NULL),
  RANDOM = list(name = "RANDOM", param = NULL)
)
results <- evaluate(anime_evaluation, method = models_evaluate, n = c(1, 3, 5, 15, 20))

## IBCF run fold/sample [model time/prediction time]
##   1

## Warning in .local(x, ...): x was already normalized by row!

## [25.6sec/0.08sec] 
##   2

## Warning in .local(x, ...): x was already normalized by row!

## [24.32sec/0.08sec] 
##   3

## Warning in .local(x, ...): x was already normalized by row!

## [24.78sec/0.06sec] 
##   4

## Warning in .local(x, ...): x was already normalized by row!

## [24.08sec/0.08sec] 
## SVD run fold/sample [model time/prediction time]
##   1

## Warning in .local(x, ...): x was already normalized by row!

## [0.66sec/0.27sec] 
##   2

## Warning in .local(x, ...): x was already normalized by row!

## [0.63sec/0.28sec] 
##   3

## Warning in .local(x, ...): x was already normalized by row!

## [0.85sec/0.28sec] 
##   4

## Warning in .local(x, ...): x was already normalized by row!

## [0.64sec/0.28sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1

## Warning in .local(x, ...): x was already normalized by row!

## [0.02sec/1.3sec] 
##   2

## Warning in .local(x, ...): x was already normalized by row!

## [0.01sec/1.31sec] 
##   3

## Warning in .local(x, ...): x was already normalized by row!

## [0.01sec/1.31sec] 
##   4

## Warning in .local(x, ...): x was already normalized by row!

## [0.02sec/1.31sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.32sec] 
##   2  [0sec/0.32sec] 
##   3  [0sec/0.33sec] 
##   4  [0.02sec/0.35sec]

plot(results, annotate = T, legend = "topleft") 
title("ROC Curve")

Precision-Recall

plot(results, "prec/rec", annotate = T, legend = "bottomright")
title("Pecision-Recall")

Runtime comparison

# Run time comparison
log <- as.data.frame(unlist(tic.log(format = TRUE)))
colnames(log) <- c("Run Time")
knitr::kable(log, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Run Time
IBCF Model Training: 25.76 sec elapsed
IBCF Model Predicting: top 10: 0.11 sec elapsed
IBCF Model Predicting: ratings: 0.06 sec elapsed
SVD Model Training: 0.39 sec elapsed
SVD Model predicting: top 10: 0.32 sec elapsed
SVD Model predicting: ratings: 0.21 sec elapsed
Hybrid Recommender Training: 26.37 sec elapsed
Hybrid Recommender Predicting: top 10: 6.55 sec elapsed
Hybrid Recommender Predicting: ratings: 6.38 sec elapsed

Summary

We know that the low error along with lower runtime is an indicator of good performance. Based on our Accuracy table and Runtime table, Hybrid has the best performance. An explanation for this is it makes up for the shortcoming of each other. We also noted that there are problems recommending items to new user because of lack of historical data.

Take home from this course

In our day to day life, we used recommender like Amazon.com, Netflix.com etc, but didn’t know about the underlying algorithms. This course offered an opportunity to learn them through a variety of exercises.

Sources

https://myanimelist.net/

https://towardsdatascience.com/building-predictive-models-with-myanimelist-and-sklearn-54edc6c9fff3

https://www.kaggle.com/CooperUnion/anime-recommendations-database

Marker: 612-06_p