Movie Recommender System with Large Dataset
Objectives
The goal for your final project is for you to build out a recommender system using a large dataset (ex: 1M+ ratings or 10k+ users, 10k+ items. There are three deliverables, with separate dates:
[1] Planning Document Find an interesting dataset and describe the system you plan to build out. If you would like to use one of the datasets you have already worked with, you should add a unique element or incorporate additional data. (i.e. explicit features you scrape from another source, like image analysis on movie posters). The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered. There is no preference for one over the other, as long as your recommender works! The planning document should be written up and published as a notebook on GitHub or in RPubs.Please submit the link in the Unit 4 folder, due Thursday, July 5.
[2] Presentation. Make a five-minute presentation of your system in our final meetup on Tuesday. If you’re not able to attend the meetup, you’re responsible for either recording your presentation, or scheduling one-on-one time to deliver your presentation prior to the meetup. You should be prepared to present on Tuesday. You should use this project to showcase some of the concepts that you have learned in this course, while delivering on the (probably) less familiar Spark platform. You are welcome to submit a compelling alternative proposal (subject to approval), such as implementing a recommender system using in Microsoft Azure ML Studio or with Google TensorFlow, or building out an application of a certain complexity using another tool. You may work in a small group (2-3) on this assignment.
[3] Implementation. In this final project deliverable, you’ll build out the system that you describe in your planning document. This will be due on Thursday and must be turned in as an RMarkdown file or a Jupyter notebook, and posted to GitHub or RPubs.com.
Libraries
Preamble
In our proposal, we said that we would use full file of movielense dataset from section “recommended for education and development” of site https://grouplens.org/datasets/movielens/. But, the dataset was so large that it ran out of memory, while creating the matrix. The error message is provided below.
Error: cannot allocate vector of size 113.7 Gb
So, we changed our plan and got similar data, of lower volume, from Kaggle that also fulfills the minimum requirements of the project. The Kaggle data is described below.
Data
MyAnimeList, often abbreviated as MAL, is an anime and manga social networking and social cataloging application website. The site provides its users with a list-like system to organize and score anime and manga. It facilitates finding users who share similar tastes and provides a large database on anime and manga. In 2018, MyAnimeList reported having approximately 15,000 anime and 45,000 manga entries. In 2015, the site received 120 million visitors a month.
We gathered data from Kaggle. Kaggle provides two csv. Description of the data is as follows:
This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.
Anime.csv
- anime_id - myanimelist.net’s unique id identifying an anime.
- name - full name of anime.
- genre - comma separated list of genres for this anime.
- type - movie, TV, OVA, etc.
- episodes - how many episodes in this show. (1 if movie).
- rating - average rating out of 10 for this anime.
- members - number of community members that are in this anime’s “group”.
Rating.csv
- user_id - non identifiable randomly generated user id.
- anime_id - the anime that this user has rated.
- rating - rating out of 10 this user has assigned (-1 if the user watched it but didn’t assign a rating).
Load Data
Preview data
#Preview anime data
kable(head(anime, n = 10L)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T, color = "white", background = "#fc5e5e") %>%
scroll_box(width = "100%", height = "200px")
anime_id | name | genre | type | episodes | rating | members |
---|---|---|---|---|---|---|
32281 | Kimi no Na wa. | Drama, Romance, School, Supernatural | Movie | 1 | 9.37 | 200630 |
5114 | Fullmetal Alchemist: Brotherhood | Action, Adventure, Drama, Fantasy, Magic, Military, Shounen | TV | 64 | 9.26 | 793665 |
28977 | Gintama° | Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen | TV | 51 | 9.25 | 114262 |
9253 | Steins;Gate | Sci-Fi, Thriller | TV | 24 | 9.17 | 673572 |
9969 | Gintama' | Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen | TV | 51 | 9.16 | 151266 |
32935 | Haikyuu!!: Karasuno Koukou VS Shiratorizawa Gakuen Koukou | Comedy, Drama, School, Shounen, Sports | TV | 10 | 9.15 | 93351 |
11061 | Hunter x Hunter (2011) | Action, Adventure, Shounen, Super Power | TV | 148 | 9.13 | 425855 |
820 | Ginga Eiyuu Densetsu | Drama, Military, Sci-Fi, Space | OVA | 110 | 9.11 | 80679 |
15335 | Gintama Movie: Kanketsu-hen - Yorozuya yo Eien Nare | Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen | Movie | 1 | 9.10 | 72534 |
15417 | Gintama': Enchousen | Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen | TV | 13 | 9.11 | 81109 |
## Rows: 12,294
## Columns: 7
## $ anime_id <int> 32281, 5114, 28977, 9253, 9969, 32935, 11061, 820, 15335, ...
## $ name <fct> Kimi no Na wa., Fullmetal Alchemist: Brotherhood, GintamaÂ...
## $ genre <fct> "Drama, Romance, School, Supernatural", "Action, Adventure...
## $ type <fct> Movie, TV, TV, TV, TV, TV, TV, OVA, Movie, TV, TV, Movie, ...
## $ episodes <fct> 1, 64, 51, 24, 51, 10, 148, 110, 1, 13, 24, 1, 201, 25, 25...
## $ rating <dbl> 9.37, 9.26, 9.25, 9.17, 9.16, 9.15, 9.13, 9.11, 9.10, 9.11...
## $ members <int> 200630, 793665, 114262, 673572, 151266, 93351, 425855, 806...
#Preview ratings data
kable(head(anime_ratings, n = 10L)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T, color = "white", background = "#fc5e5e") %>%
scroll_box(width = "100%", height = "200px")
user_id | anime_id | rating |
---|---|---|
1 | 20 | -1 |
1 | 24 | -1 |
1 | 79 | -1 |
1 | 226 | -1 |
1 | 241 | -1 |
1 | 355 | -1 |
1 | 356 | -1 |
1 | 442 | -1 |
1 | 487 | -1 |
1 | 846 | -1 |
## Rows: 7,813,737
## Columns: 3
## $ user_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ anime_id <int> 20, 24, 79, 226, 241, 355, 356, 442, 487, 846, 936, 1546, ...
## $ rating <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1...
Clean the Data
As mentioned above, if the user watched but didn’t assign a rating, then corresponding data field has -1. So, we converted the unrated data to ‘NA’, and changed the data type for downstream analysis.
anime_ratings$rating[anime_ratings$rating == -1] <- NA # -1 if the user watched it but didn't assign a rating
anime_sp <- anime_ratings
anime_ratings$user_id <- as.factor(anime_ratings$user_id)
anime_ratings$anime_id <- as.factor(anime_ratings$anime_id)
anime$anime_id <- as.factor(anime$anime_id)
anime$name <- as.character(anime$name)
anime$type <- as.character(anime$type)
anime$genre <- as.character(anime$genre)
## Rows: 7,813,737
## Columns: 3
## $ user_id <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ anime_id <fct> 20, 24, 79, 226, 241, 355, 356, 442, 487, 846, 936, 1546, ...
## $ rating <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## Rows: 12,294
## Columns: 7
## $ anime_id <fct> 32281, 5114, 28977, 9253, 9969, 32935, 11061, 820, 15335, ...
## $ name <chr> "Kimi no Na wa.", "Fullmetal Alchemist: Brotherhood", "Gin...
## $ genre <chr> "Drama, Romance, School, Supernatural", "Action, Adventure...
## $ type <chr> "Movie", "TV", "TV", "TV", "TV", "TV", "TV", "OVA", "Movie...
## $ episodes <fct> 1, 64, 51, 24, 51, 10, 148, 110, 1, 13, 24, 1, 201, 25, 25...
## $ rating <dbl> 9.37, 9.26, 9.25, 9.17, 9.16, 9.15, 9.13, 9.11, 9.10, 9.11...
## $ members <int> 200630, 793665, 114262, 673572, 151266, 93351, 425855, 806...
Data Exploration
Highest rated animes
anime %>% arrange(desc(rating)) %>%
top_n(10) %>% kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T, color = "white", background = "#fc5e5e") %>%
scroll_box(width = "100%", height = "200px")
## Selecting by members
anime_id | name | genre | type | episodes | rating | members |
---|---|---|---|---|---|---|
5114 | Fullmetal Alchemist: Brotherhood | Action, Adventure, Drama, Fantasy, Magic, Military, Shounen | TV | 64 | 9.26 | 793665 |
9253 | Steins;Gate | Sci-Fi, Thriller | TV | 24 | 9.17 | 673572 |
1575 | Code Geass: Hangyaku no Lelouch | Action, Mecha, Military, School, Sci-Fi, Super Power | TV | 25 | 8.83 | 715151 |
1535 | Death Note | Mystery, Police, Psychological, Supernatural, Thriller | TV | 37 | 8.71 | 1013917 |
16498 | Shingeki no Kyojin | Action, Drama, Fantasy, Shounen, Super Power | TV | 25 | 8.54 | 896229 |
4224 | Toradora! | Comedy, Romance, School, Slice of Life | TV | 25 | 8.45 | 633817 |
6547 | Angel Beats! | Action, Comedy, Drama, School, Supernatural | TV | 13 | 8.39 | 717796 |
10620 | Mirai Nikki (TV) | Action, Mystery, Psychological, Shounen, Supernatural, Thriller | TV | 26 | 8.07 | 657190 |
11757 | Sword Art Online | Action, Adventure, Fantasy, Game, Romance | TV | 25 | 7.83 | 893100 |
20 | Naruto | Action, Comedy, Martial Arts, Shounen, Super Power | TV | 220 | 7.81 | 683297 |
Most watched type of show
anime %>% count(type)%>%
ggplot(aes(x = type, y = n)) +
geom_bar(stat = "identity", fill = "light blue" ) +
geom_text(aes(label=n), vjust= -0.6, color="black", size=3.5) +
theme_minimal()
Anime with the most members
anime %>% arrange(desc(members)) %>%
top_n(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T, color = "white", background = "#fc5e5e") %>%
scroll_box(width = "100%", height = "200px")
## Selecting by members
anime_id | name | genre | type | episodes | rating | members |
---|---|---|---|---|---|---|
1535 | Death Note | Mystery, Police, Psychological, Supernatural, Thriller | TV | 37 | 8.71 | 1013917 |
16498 | Shingeki no Kyojin | Action, Drama, Fantasy, Shounen, Super Power | TV | 25 | 8.54 | 896229 |
11757 | Sword Art Online | Action, Adventure, Fantasy, Game, Romance | TV | 25 | 7.83 | 893100 |
5114 | Fullmetal Alchemist: Brotherhood | Action, Adventure, Drama, Fantasy, Magic, Military, Shounen | TV | 64 | 9.26 | 793665 |
6547 | Angel Beats! | Action, Comedy, Drama, School, Supernatural | TV | 13 | 8.39 | 717796 |
1575 | Code Geass: Hangyaku no Lelouch | Action, Mecha, Military, School, Sci-Fi, Super Power | TV | 25 | 8.83 | 715151 |
20 | Naruto | Action, Comedy, Martial Arts, Shounen, Super Power | TV | 220 | 7.81 | 683297 |
9253 | Steins;Gate | Sci-Fi, Thriller | TV | 24 | 9.17 | 673572 |
10620 | Mirai Nikki (TV) | Action, Mystery, Psychological, Shounen, Supernatural, Thriller | TV | 26 | 8.07 | 657190 |
4224 | Toradora! | Comedy, Romance, School, Slice of Life | TV | 25 | 8.45 | 633817 |
Create Matrix
## 73515 x 11200 rating matrix of class 'realRatingMatrix' with 7813730 ratings.
## [1] 73515 11200
## 99233736 bytes
Selecting the most relevant data
On exploring the data, we noticed that the table contains:
- Ratings of the animes that have been viewed only a few times, and therefore might be biased. So, we’ll keep movies that have been watched at least 1000 times.
- Ratings of the Users, who rated only a few movies, might be biased too. So, we’ll keep users, who have rated at least 500 anime shows.
## 1843 x 1720 rating matrix of class 'realRatingMatrix' with 967727 ratings.
## 11850056 bytes
Data Visualization
Exploring the values of the rating
# Vectorize and create unique vector.
vector_ratings <- as.vector(AnimeMatrix@data@x)
unique(vector_ratings)
## [1] NA 8 7 9 10 2 4 5 6 1 3
# The ratings are in the range of 0-10. Let's count the occurrences of each of them.
table_ratings <- table(vector_ratings)
kable(table_ratings)
vector_ratings | Freq |
---|---|
1 | 2015 |
2 | 3081 |
3 | 5918 |
4 | 14210 |
5 | 38643 |
6 | 90254 |
7 | 191638 |
8 | 218259 |
9 | 135001 |
10 | 83198 |
table_ratings <- as.data.frame(table(as.vector(AnimeMatrix@data@x)))
ggplot(table_ratings, aes(x = Var1, y = Freq, fill = Var1)) +
geom_bar(stat = "identity") +
ggtitle("Distribution of Ratings for Anime Items") +
geom_text(aes(label=Freq), vjust= -0.6, color="black", size=3.5) +
theme(legend.position="none") + xlab("Rating Score") + ylab("Fequency")
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 7.00 8.00 7.61 9.00 10.00 185510
Exploring which animes have been viewed
table_views <- data.frame(
anime_names = names(views_per_anime),
views = views_per_anime
)
names(table_views)[names(table_views) == "anime_names"] <- "anime_id"
table_views <- merge(table_views, anime, by="anime_id")
table_views <- table_views[order(table_views$views, decreasing =
TRUE), ]
ggplot(table_views[1:10, ], aes(x = name, y = views)) +
geom_bar(stat="identity") + theme(axis.text.x =
element_text(angle = 45, hjust = 1)) + ggtitle("Number of views of the top 10 animes") + geom_text(aes(label=views), vjust= -0.6, color="black", size=3.5)
Exploring the average ratings
average_ratings <- data.frame("avg_rating" = colMeans(AnimeMatrix)) %>%
ggplot(aes(x = avg_rating)) +
geom_histogram(color = "black", fill = "lightblue", binwidth = 0.1) +
theme( axis.line = element_line(colour = "darkblue", size = 1, linetype = "solid"))+
ggtitle("Distribution of Average Ratings for Anime Shows")
average_ratings
Data Normalization
Similarity
Similarity among the first 50 users
sim_user <- similarity(anime_Normalization[1:100, ], method = "cosine", which = "users")
image(as.matrix(sim_user), main = "User Similarity")
Similarity among the first 50 anime items.
sim_item <- similarity(anime_Normalization[, 1:100], method = "cosine", which = "items")
image(as.matrix(sim_item), main = "Item Similarity")
Based on the similarity plots, items have more in common than users with each other.
Recommendation algorithms
Split the dataset into training set (80%) and testing set (20%):
# min(rowCounts(anime_Normalization) = 4 so we can keep 4 items per user
anime_evaluation <- evaluationScheme(data = anime_Normalization, method = "split", train = 0.8, given = 4, goodRating = 5, k = 4)
anime_evaluation
## Evaluation scheme with 4 items given
## Method: 'split' with 4 run(s).
## Training set proportion: 0.800
## Good ratings: >=5.000000
## Data set: 1843 x 1720 rating matrix of class 'realRatingMatrix' with 967727 ratings.
## Normalized using center on rows.
Item-Item Collaborative Filtering
This is a filtering method, where similarity between items is calculated using users’ ratings of items. That means the algorithm recommends items similar to the users’ previous selections. In the algorithm, the similarities between different items are computed by one of the similarity measures, and then similarity values are used to predict ratings for user-item pairs absent in the data.
Training model
In below step we’ll train the model.
tic("IBCF Model Training")
(model_IBCF <- Recommender(data = getData(anime_evaluation, "train"), method = "IBCF"))
## Warning in .local(x, ...): x was already normalized by row!
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 1474 users.
Examining the Similarity Matrix
similarityMatrix <- getModel(model_IBCF)$sim
which_max <- order(colSums(similarityMatrix > 0), decreasing = TRUE)[1:10]
topAnimes <- as.data.frame(as.integer(rownames(similarityMatrix)[which_max]))
colnames(topAnimes) <- c("anime_id")
topAnimes$anime_id <- as.factor(topAnimes$anime_id)
data <- topAnimes %>% inner_join(anime, by = "anime_id") %>% select(Anime_name = "name")
kable((data)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T, color = "white", background = "#fc5e5e") %>%
scroll_box(width = "100%", height = "200px")
Anime_name |
---|
Yume-iro Pâtissière |
Inazuma Eleven |
Haikyuu!!: Karasuno Koukou VS Shiratorizawa Gakuen Koukou |
Shouwa Genroku Rakugo Shinjuu |
Koisuru Boukun |
Kamikaze Kaitou Jeanne |
Kimi no Na wa. |
Hanasakeru Seishounen |
Mermaid Melody Pichi Pichi Pitch |
Gintama° |
Predict
tic("IBCF Model Predicting: top 10")
anime_pred <- predict(object = model_IBCF, newdata = getData(anime_evaluation, "known"), n = 10)
toc(log = T, quiet = T)
tic("IBCF Model Predicting: ratings")
anime_predr <- predict(object = model_IBCF, newdata = getData(anime_evaluation, "known"), type = "ratings")
toc(log = T, quiet = T)
## $`226`
## [1] 173 436 507 993 1059 1107 1390 1453 1525 1569
##
## $`392`
## [1] 540 702 868 930 972 991 1193 1341 286 317
##
## $`478`
## [1] 65 638 826 1090 1371 1399 1449 1476 1536 1544
##
## $`804`
## integer(0)
##
## $`2632`
## [1] 643 870 1360 43 50 107 114 163 180 222
##
## $`3009`
## [1] 477 1 16 18 24 135 171 181 195 217
##
## $`3117`
## [1] 15 22 130 132 139 158 213 358 455 464
##
## $`3338`
## [1] 68 74 155 262 415 721 724 818 822 844
Due to lack of historical data, sometimes the IBCF model may not recommend any items for one or more users.
# function to match anime id with names of anime items
item_recc <- function(i){
p <- anime_pred@items[[i]]
p <- data.frame("id" = as.factor(p))
p <- inner_join(p, anime, by = c("id" = "anime_id")) %>% select(name, type)
return(as.data.frame(p))
}
## [[1]]
## name type
## 1 Appleseed (Movie) Movie
## 2 Avenger TV
## 3 Mobile Suit Gundam: The 08th MS Team OVA
## 4 Mobile Suit Gundam 0080: War in the Pocket OVA
## 5 X/1999 Movie
## 6 X TV
## 7 Sen to Chihiro no Kamikakushi Movie
## 8 Planetes TV
## 9 InuYasha: Guren no Houraijima Movie
## 10 InuYasha: Kagami no Naka no Mugenjo Movie
##
## [[2]]
## name type
## 1 Rozen Maiden: Träumend TV
## 2 Mobile Suit Gundam I Movie
## 3 Hi no Tori TV
## 4 Macross 7 Encore OVA
## 5 City Hunter: Hyakuman Dollar no Inbou OVA
## 6 Busou Renkin TV
## 7 Super Robot Taisen OG The Animation OVA
##
## [[3]]
## name type
## 1 Tenchi Muyou! Ryououki 2nd Season OVA
## 2 Slayers Great Movie
## 3 Densetsu Kyojin Ideon TV
## 4 Bible Black Gaiden OVA
## 5 Usagi-chan de Cue!! OVA
## 6 Happy Seven: The TV Manga TV
## 7 Violence Jack: Harlem Bomber-hen OVA
## 8 B'T X TV
## 9 Final Fantasy VII: Advent Children Movie
##
## [[4]]
## name type
## 1 Naruto TV
## 2 Kidou Tenshi Angelic Layer TV
## 3 Arc the Lad TV
## 4 Chobits TV
## 5 Basilisk: Kouga Ninpou Chou TV
## 6 Mahou Shoujo Lyrical Nanoha A's TV
## 7 Shuffle! TV
## 8 Boys Be... TV
## 9 Chuuka Ichiban! TV
Single Value Decomposition
Please refer RPubs link for our detailed explanation of SVD, which we provided in Project 3.
Training
tic ("SVD Model Training")
(model_svd <- Recommender(data = getData(anime_evaluation, "train"), method = "SVD"))
## Warning in .local(x, ...): x was already normalized by row!
## Recommender of type 'SVD' for 'realRatingMatrix'
## learned using 1474 users.
Predict
tic ("SVD Model predicting: top 10")
anime_svd_pred <- predict(object = model_svd, newdata = getData(anime_evaluation, "known"), n = 10)
toc(log = T, quiet = T)
tic ("SVD Model predicting: ratings")
anime_svd_predr <- predict(object = model_svd, newdata = getData(anime_evaluation, "known"), type = "ratings")
toc(log = T, quiet = T)
## $`226`
## [1] 1569 1684 1632 1618 1665 1699 1635 1551 1679 1655
##
## $`392`
## [1] 1158 9 1505 505 159 1152 1429 133 458 1386
##
## $`478`
## [1] 1635 1569 1665 1618 1632 1699 1676 1589 1679 1600
##
## $`804`
## [1] 516 429 521 605 537 665 705 644 38 529
##
## $`2632`
## [1] 172 285 407 214 254 321 99 217 567 288
##
## $`3009`
## [1] 16 18 135 583 1 671 235 21 24 333
##
## $`3117`
## [1] 494 587 467 9 159 219 737 761 545 869
##
## $`3338`
## [1] 575 669 708 1158 1635 670 495 768 709 1671
As opposed to IBCF, the SVD algorithm provides a recommendation for every user. It’s a reliable practice to use SVD, to estimate missing data in a data matrix.
# function to match anime id with names of anime items
svd_recc <- function(i){
p <- anime_svd_pred@items[[i]]
p <- data.frame("id" = as.factor(p))
p <- inner_join(p, anime, by = c("id" = "anime_id")) %>% select(name, type)
return(as.data.frame(p))
}
## [[1]]
## name type
## 1 Ghost in the Shell: Stand Alone Complex TV
## 2 eX-Driver the Movie Movie
## 3 Mizuiro OVA
## 4 Shuffle! TV
## 5 Girls Bravo: Second Season TV
## 6 Buttobi!! CPU OVA
## 7 Mazeâ\230†Bakunetsu Jikuu (TV) TV
## 8 Pokemon: Senritsu no Mirage Pokemon Special
##
## [[2]]
## name type
## 1 Ai Shimai 2: Futari no Kajitsu OVA
## 2 Otome wa Boku ni Koishiteru TV
## 3 Babel Nisei (1992) OVA
## 4 Soreyuke! Uchuu Senkan Yamamoto Yohko (1999) TV
## 5 Shintaisou: Shin OVA
## 6 Romeo x Juliet TV
## 7 Cosmo Warrior Zero TV
## 8 Bartender TV
## 9 Green Green Specials OVA
## 10 Galaxy Angel Rune TV
##
## [[3]]
## name type
## 1 Geobreeders: File-X Chibi Neko Dakkan OVA
## 2 Detective Conan Movie 09: Strategy Above the Depths Movie
## 3 Fushigiboshi noâ\230†Futagohime TV
## 4 Boukyaku no Senritsu TV
## 5 New Dominion Tank Police OVA
## 6 Lupin III: Nusumareta Lupin Special
## 7 Green Green TV
## 8 Buttobi!! CPU OVA
## 9 Blood Royale OVA
##
## [[4]]
## name type
## 1 Kage kara Mamoru! TV
## 2 Yuâ\230†Giâ\230†Oh!: Duel Monsters GX TV
## 3 Gokinjo Monogatari TV
## 4 Tokyo Godfathers Movie
## 5 Grappler Baki: Saidai Tournament-hen TV
## 6 Bishoujo Senshi Sailor Moon: Sailor Stars TV
## 7 Weiß Kreuz OVA OVA
## 8 Zetsuai 1989 OVA
## 9 Blame! ONA
Hybrid Recommender
In order to incorporate serendipity, novelty, or diversity we created a hybrid model, where we used the following weights:
50% for IBCF
30% for POPULAR
10% for RERECOMMEND
10% for RANDOM
Training
tic("Hybrid Recommender Training")
model_hybrid <- HybridRecommender(
Recommender(data = getData(anime_evaluation, "train"), method = "IBCF"),
Recommender(data = getData(anime_evaluation, "train"), method = "POPULAR"),
Recommender(data = getData(anime_evaluation, "train"), method = "RERECOMMEND"),
Recommender(data = getData(anime_evaluation, "train"), method = "RANDOM"),
weights = c(0.5, 0.3, 0.1, 0.1) # diversity
)
## Warning in .local(x, ...): x was already normalized by row!
## Warning in .local(x, ...): x was already normalized by row!
Predict
tic("Hybrid Recommender Predicting: top 10")
anime_hybrid_pred <- predict(object = model_hybrid, newdata = getData(anime_evaluation, "known"), n = 10)
toc(log = T, quiet = T)
tic("Hybrid Recommender Predicting: ratings")
anime_hybrid_predr <- predict(object = model_hybrid, newdata = getData(anime_evaluation, "known"), type = "ratings")
toc(log = T, quiet = T)
## $`226`
## [1] 1038 45 671 996 728 590 226 8 1633 139
##
## $`392`
## [1] 1038 1024 996 705 761 1633 1041 1125 529 70
##
## $`478`
## [1] 482 1038 858 283 226 154 1633 529 671 1287
##
## $`804`
## [1] 333 1708 482 378 558 698 1125 1038 1492 600
##
## $`2632`
## [1] 761 1633 1038 698 8 1708 245 1185 1717 378
##
## $`3009`
## [1] 616 1229 810 18 724 271 691 477 1347 154
##
## $`3117`
## [1] 1038 996 332 1185 1671 590 15 467 558 1161
##
## $`3338`
## [1] 1287 1186 1125 1185 1655 627 567 425 285 154
Some of the items recommended by IBCF and SVD did repeat in the hybrid recommeder.
Let’s see the actual items recommended.
# function to match anime id with names of anime items
hybrid_recc <- function(i){
p <- anime_hybrid_pred@items[[i]]
p <- data.frame("id" = as.factor(p))
p <- inner_join(p, anime, by = c("id" = "anime_id")) %>% select(name, type)
return(as.data.frame(p))
}
## [[1]]
## name type
## 1 Naruto: Akaki Yotsuba no Clover wo Sagase Special
## 2 Bishoujo Senshi Sailor Moon: Sailor Stars TV
## 3 eX-Driver the Movie Movie
## 4 Yuâ\230†Giâ\230†Oh!: Duel Monsters GX TV
## 5 Haru no Ashioto The Movie: Ourin Dakkan Movie
## 6 Shintaisou: Kari OVA
## 7 Soreyuke! Uchuu Senkan Yamamoto Yohko II OVA
## 8 Ojamajo Doremi Dokkaan! TV
##
## [[2]]
## name type
## 1 Yuâ\230†Giâ\230†Oh!: Duel Monsters GX TV
## 2 Mizuiro OVA
## 3 Gunparade Orchestra TV
## 4 Akage no Anne TV
## 5 Elfen Lied TV
## 6 Shaman King TV
## 7 Shintaisou: Kari OVA
## 8 Saishuu Heiki Kanojo TV
## 9 Lemon Angel Project TV
## 10 Odin: Koushi Hansen Starlight Movie
##
## [[3]]
## name type
## 1 Mizuiro OVA
## 2 Weiß Kreuz OVA OVA
## 3 Bishoujo Senshi Sailor Moon: Sailor Stars TV
## 4 Chou Henshin Cosprayers vs. Ankoku Uchuu Shougun the Movie Movie
## 5 Naruto: Akaki Yotsuba no Clover wo Sagase Special
## 6 Shintaisou: Kari OVA
## 7 Zetsuai 1989 OVA
## 8 Saishuu Heiki Kanojo TV
##
## [[4]]
## name type
## 1 Shintaisou: Kari OVA
## 2 Mizuiro OVA
## 3 Battle Athletess Daiundoukai OVA
## 4 Mobile Police Patlabor: WXIII Movie
## 5 Bishoujo Senshi Sailor Moon: Sailor Stars TV
## 6 Naruto: Akaki Yotsuba no Clover wo Sagase Special
## 7 Android Ana Maico 2010 TV
## 8 Petshop of Horrors TV
Calculating and comparing accuracies
IBCF
anime_item_acc1 <- calcPredictionAccuracy(x = anime_pred, data = getData(anime_evaluation, "unknown"), given = 4, goodRating = 5)
anime_item_acc2 <- calcPredictionAccuracy(x = anime_predr, data = getData(anime_evaluation, "unknown"))
SVD
anime_svd_acc1 <- calcPredictionAccuracy(x = anime_svd_pred, data = getData(anime_evaluation, "unknown"), given = 4, goodRating = 5)
anime_svd_acc2 <- calcPredictionAccuracy(x = anime_svd_predr, data = getData(anime_evaluation, "unknown"))
Hybrid
anime_hy_acc1 <- calcPredictionAccuracy(x = anime_hybrid_pred, data = getData(anime_evaluation, "unknown"), given = 4, goodRating = 5)
anime_hy_acc2 <- calcPredictionAccuracy(x = anime_hybrid_predr, data = getData(anime_evaluation, "unknown"))
TopN
TP | FP | FN | TN | precision | recall | TPR | FPR | |
---|---|---|---|---|---|---|---|---|
anime_item_acc1 | 0.2710027 | 8.065041 | 109.1463 | 1598.518 | 0.0382322 | 0.0055655 | 0.0055655 | 0.0048232 |
anime_svd_acc1 | 1.3929539 | 8.607046 | 108.0244 | 1597.976 | 0.1392954 | 0.0066003 | 0.0066003 | 0.0052008 |
anime_hy_acc1 | 0.9186992 | 9.081301 | 108.4986 | 1597.501 | 0.0918699 | 0.0082115 | 0.0082115 | 0.0055893 |
Ratings
RMSE | MSE | MAE | |
---|---|---|---|
anime_item_acc2 | 1.546860 | 2.392774 | 1.094808 |
anime_svd_acc2 | 1.673719 | 2.801336 | 1.224757 |
anime_hy_acc2 | 1.595474 | 2.545536 | 1.160603 |
ROC Curve
models_evaluate <- list(
IBCF = list(name = "IBCF", param = list(method = "cosine")),
SVD = list(name = "SVD", param = list(k = 30)),
POPULAR = list(name = "POPULAR", param = NULL),
RANDOM = list(name = "RANDOM", param = NULL)
)
results <- evaluate(anime_evaluation, method = models_evaluate, n = c(1, 3, 5, 15, 20))
## IBCF run fold/sample [model time/prediction time]
## 1
## Warning in .local(x, ...): x was already normalized by row!
## [25.6sec/0.08sec]
## 2
## Warning in .local(x, ...): x was already normalized by row!
## [24.32sec/0.08sec]
## 3
## Warning in .local(x, ...): x was already normalized by row!
## [24.78sec/0.06sec]
## 4
## Warning in .local(x, ...): x was already normalized by row!
## [24.08sec/0.08sec]
## SVD run fold/sample [model time/prediction time]
## 1
## Warning in .local(x, ...): x was already normalized by row!
## [0.66sec/0.27sec]
## 2
## Warning in .local(x, ...): x was already normalized by row!
## [0.63sec/0.28sec]
## 3
## Warning in .local(x, ...): x was already normalized by row!
## [0.85sec/0.28sec]
## 4
## Warning in .local(x, ...): x was already normalized by row!
## [0.64sec/0.28sec]
## POPULAR run fold/sample [model time/prediction time]
## 1
## Warning in .local(x, ...): x was already normalized by row!
## [0.02sec/1.3sec]
## 2
## Warning in .local(x, ...): x was already normalized by row!
## [0.01sec/1.31sec]
## 3
## Warning in .local(x, ...): x was already normalized by row!
## [0.01sec/1.31sec]
## 4
## Warning in .local(x, ...): x was already normalized by row!
## [0.02sec/1.31sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/0.32sec]
## 2 [0sec/0.32sec]
## 3 [0sec/0.33sec]
## 4 [0.02sec/0.35sec]
Runtime comparison
# Run time comparison
log <- as.data.frame(unlist(tic.log(format = TRUE)))
colnames(log) <- c("Run Time")
knitr::kable(log, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Run Time |
---|
IBCF Model Training: 25.76 sec elapsed |
IBCF Model Predicting: top 10: 0.11 sec elapsed |
IBCF Model Predicting: ratings: 0.06 sec elapsed |
SVD Model Training: 0.39 sec elapsed |
SVD Model predicting: top 10: 0.32 sec elapsed |
SVD Model predicting: ratings: 0.21 sec elapsed |
Hybrid Recommender Training: 26.37 sec elapsed |
Hybrid Recommender Predicting: top 10: 6.55 sec elapsed |
Hybrid Recommender Predicting: ratings: 6.38 sec elapsed |
Summary
We know that the low error along with lower runtime is an indicator of good performance. Based on our Accuracy table and Runtime table, Hybrid has the best performance. An explanation for this is it makes up for the shortcoming of each other. We also noted that there are problems recommending items to new user because of lack of historical data.
Take home from this course
In our day to day life, we used recommender like Amazon.com, Netflix.com etc, but didn’t know about the underlying algorithms. This course offered an opportunity to learn them through a variety of exercises.