library(tidyverse)
library(recommenderlab)
This is part of an anime data set scraped from myanimelist. I used the first million rows of user data and the entire item data set. This is not the full data set.
https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database?resource=download
#read CSVs
user_ratings <- read.csv('https://raw.githubusercontent.com/samanthabarbaro/data612_recommender_systems/refs/heads/main/animeratings.csv', na.strings = c("", "NA", "null", "NULL"), header = TRUE)
item_info <- read.csv('https://raw.githubusercontent.com/samanthabarbaro/data612_recommender_systems/refs/heads/main/anime.csv', na.strings = c("", "NA", "null", "NULL"), header = TRUE)
#number of user ratings
count(user_ratings)
## n
## 1 1048575
#number of shows/movies
count(item_info)
## n
## 1 12294
#number of ratings per user:
#user_ratings |> count(user_id)
#number of users
n_distinct(user_ratings$user_id)
## [1] 10093
We have over 1 million anime ratings by about 10k users.
If a user marked an anime as watched but didn’t rate it, it’s scored as -1. I am going to filter those out because we are looking at algorithms that predict scores, and this would throw off the baseline significantly. However, it would be good for an implicit ratings set. Anime is a good candidate for implicit ratings - there are movies here, but fans probably wouldn’t mark an entire series as “watched” unless they liked it at least a little.
user_ratings <- user_ratings |>
filter(rating > -1)
#854k ratings left
count(user_ratings)
## n
## 1 854113
Join the user ratings matrix with the anime name based on anime ID:
#remove weird text from item info first
item_info$name <- gsub(""", "", item_info$name)
item_info$name <- gsub("amp;", "", item_info$name)
#join
user_item <- left_join(user_ratings, item_info |>
select(anime_id, name), by = "anime_id")
#remove the anime ID
user_item <- user_item |>
select(user_id, rating, name)
More exploratory analysis to see how many ratings users have.
anime_table <- pivot_wider(
data = user_item,
names_from = name,
values_from = rating
)
rating_counts <- rowSums(!is.na(anime_table |> select(-user_id)))
table(rating_counts)
## rating_counts
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 473 300 253 195 186 164 159 150 118 126 136 124 91 102 119 102
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 109 93 85 105 84 74 89 90 72 78 83 82 66 65 71 70
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 54 68 81 46 57 62 57 54 55 51 60 46 52 71 56 57
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 46 50 36 60 40 49 48 39 49 37 46 42 56 42 48 38
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 37 41 49 41 40 27 37 30 25 35 35 43 39 28 30 23
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 24 31 27 25 28 34 23 31 28 31 30 32 33 32 25 37
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 29 35 24 16 29 22 27 22 23 24 23 28 22 22 36 22
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## 23 25 21 17 18 18 18 14 21 19 12 19 17 22 29 22
## 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 18 20 26 19 24 14 14 20 14 13 20 15 9 11 20 9
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 14 12 15 18 15 16 12 13 10 11 15 19 14 8 7 22
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 18 9 12 11 13 9 12 11 11 8 7 18 8 11 14 10
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## 5 10 15 13 5 12 12 9 10 7 12 10 10 9 10 7
## 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
## 9 7 5 6 9 10 8 8 10 5 5 5 8 7 8 9
## 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
## 8 6 6 5 11 1 9 6 11 2 7 4 6 6 9 8
## 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 5 5 10 5 5 7 8 6 6 9 5 11 9 7 7 11
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
## 8 6 1 5 4 5 3 12 10 9 4 5 6 4 9 7
## 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
## 3 4 4 6 7 5 2 8 10 8 3 2 2 11 6 2
## 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## 4 5 7 4 5 4 2 4 6 3 6 5 3 5 3 3
## 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
## 2 3 2 2 2 4 8 5 3 5 5 1 2 6 5 6
## 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 1 3 3 4 3 4 3 1 4 4 2 6 3 3 2 1
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
## 4 3 10 1 6 5 2 8 2 3 4 4 1 5 3 1
## 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
## 1 2 2 2 2 3 5 3 4 6 4 5 2 2 1 1
## 353 354 355 356 357 358 359 360 361 362 363 364 365 367 368 369
## 5 6 3 1 4 2 2 1 3 2 2 2 2 3 8 2
## 370 371 372 373 374 375 376 377 378 379 381 382 383 384 385 386
## 3 2 1 5 1 1 3 2 1 3 2 2 1 2 6 1
## 387 389 391 392 393 394 395 396 397 398 400 401 402 403 404 405
## 1 2 2 5 2 1 1 1 1 4 1 1 1 2 2 1
## 406 407 409 410 412 413 414 415 416 417 418 419 420 422 423 425
## 4 4 1 5 2 1 1 2 2 3 1 1 2 2 1 1
## 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442
## 1 3 1 1 3 1 4 1 3 3 1 2 2 1 1 1
## 443 444 445 446 447 448 449 450 451 453 455 458 459 460 461 462
## 3 1 2 1 1 2 1 4 1 1 1 1 3 1 1 2
## 463 465 468 469 471 472 473 474 476 477 482 484 485 486 487 488
## 1 2 1 1 1 1 2 2 1 2 1 1 1 2 1 1
## 489 494 495 497 498 499 500 502 504 506 507 511 513 514 516 519
## 1 1 2 1 1 2 1 1 2 1 2 1 1 3 3 2
## 521 522 524 525 527 528 529 534 540 541 545 546 547 550 551 555
## 1 1 1 2 1 1 2 1 1 1 1 1 1 2 1 1
## 560 562 563 567 568 569 570 571 572 574 575 578 581 582 584 589
## 1 2 1 2 1 1 2 1 1 1 2 1 1 1 1 1
## 593 598 599 600 602 607 608 609 610 611 615 616 617 621 622 623
## 1 1 2 1 1 1 3 2 2 1 1 1 1 1 1 1
## 624 630 631 636 638 641 650 651 656 657 660 662 663 665 666 670
## 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1
## 671 672 675 682 685 686 693 695 698 701 707 715 719 735 742 743
## 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
## 750 751 753 774 775 781 783 790 795 801 803 807 819 820 821 830
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 834 835 837 843 849 854 857 863 877 878 880 903 904 912 940 943
## 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1
## 951 952 976 989 990 993 998 1001 1011 1012 1013 1019 1032 1040 1041 1042
## 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
## 1052 1088 1132 1154 1156 1180 1188 1227 1338 1403 1412 1584 1702 2429
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1
rating_counts_2 <- as_tibble(rating_counts)
#histogram of ratings
rating_counts_2 |> ggplot(aes(x = value)) +
geom_histogram(bins = 60) +
labs(title = "Number of Ratings by User", x = "number of ratings") +
theme_minimal()
#one user has 2429 ratings
max(rating_counts)
## [1] 2429
#some users only have 1
min(rating_counts)
## [1] 1
#avg user rated 89 shows/movies
mean(rating_counts)
## [1] 89.38912
Pivot wider and make a user-item matrix:
anime_matrix <- pivot_wider(
data = user_item,
names_from = name,
values_from = rating
) |>
#make user_id the row name
column_to_rownames("user_id") |>
as.matrix()
#check dimensions
dim(anime_matrix)
## [1] 9555 7942
#create real ratings matrix
real_anime_matrix <- as(anime_matrix, "realRatingMatrix")
I will exclude users with fewer than 15 ratings, which seems reasonable given the range of ratings in this data set:
#min 15 ratings
real_anime_matrix20 <- real_anime_matrix[rowCounts(
real_anime_matrix) > 15, ]
Here, I turn the genre information about each anime into an item matrix. I do this by breaking the genre column up, getting rid of weird characters that cause matching issues, and pivoting wider. The item matrix will allow me to create a content-based collaborative filtering system.
anime_rows <- separate_longer_delim(item_info, cols = genre, delim = ",")
anime_rows <- anime_rows |> select(name, genre, type)
#clean it up
anime_rows$name <- gsub("amp;", "", anime_rows$name)
item_features <- anime_rows |>
distinct(name, genre) |>
mutate(genre_val = 1) |>
pivot_wider(names_from = genre, values_from = genre_val,
values_fill = 0, values_fn = max) |>
column_to_rownames("name")
Test to find common items - I chose to identify anime by title, not ID, which can get messy when matching two matrices. With this many entries, weird characters are likely to import. Here, I make sure everything that’s in the user matrix is in the item matrix.
common_items <- intersect(rownames(item_features),
colnames(real_anime_matrix20))
item_features_aligned <- item_features[common_items, ]
rating_matrix_aligned <- real_anime_matrix20[, common_items]
#check to make sure no items were left out because of weird punctuation
missing <- setdiff(colnames(real_anime_matrix20), rownames(item_features))
missing2 <- setdiff(rownames(item_features), colnames(real_anime_matrix20))
length(missing)
## [1] 0
length(missing2)
## [1] 4350
There are no items that have been rated by users that aren’t in the item matrix, but there are items that have not been rated by users. This makes sense - I am only using a portion of the original data set.
Calculate cosine similarity between items:
library(proxy)
# compute cosine similarity
item_sim <- as.matrix(simil(as.matrix(item_features), method = "cosine"))
item_sim_matrix <- as.matrix(item_sim)
Predict scores with 15 visible ratings:
cbf_predict <- function(user_ratings, item_sim, n = 30) {
rated_idx <- which(!is.na(user_ratings))
unrated_idx <- which(is.na(user_ratings))
scores <- sapply(unrated_idx, function(i) {
sims <- item_sim[i, rated_idx]
rats <- user_ratings[rated_idx]
sum(sims * rats) / (sum(abs(sims)) + 1e-9)
})
names(scores) <- names(user_ratings)[unrated_idx]
sort(scores, decreasing = TRUE)[1:n]
}
Evaluate (I had to adjust the number of times this runs and take the user list down to 200 because the user list was too large):
#loop over users
set.seed(1122)
rating_m <- as(rating_matrix_aligned, "matrix")
n_users <- nrow(rating_m)
test_users <- sample(1:n_users, 200)
precision_list <- c()
recall_list <- c()
rmse_list <- c()
for (i in test_users) {
user_ratings <- rating_m[i, ]
rated_idx <- which(!is.na(user_ratings))
if (length(rated_idx) < 5) next
test_idx <- sample(rated_idx, max(1, floor(0.2 * length(rated_idx))))
train_ratings <- user_ratings
train_ratings[test_idx] <- NA
preds <- cbf_predict(train_ratings, item_sim, n = 30)
pred_items <- names(preds)
relevant <- names(user_ratings[test_idx][user_ratings[test_idx] >= 6])
tp <- length(intersect(pred_items, relevant))
precision_list[i] <- tp / length(pred_items)
recall_list[i] <- ifelse(length(relevant) > 0, tp / length(relevant), NA)
common <- intersect(pred_items, names(user_ratings[test_idx]))
if (length(common) > 0) {
actual <- user_ratings[test_idx][common]
predicted <- preds[common]
rmse_list[i] <- sqrt(mean((actual - predicted)^2))
}
}
#calculate metrics
cat("Precision@30:", mean(precision_list, na.rm = TRUE), "\n")
## Precision@30: 0.002333333
cat("Recall@30: ", mean(recall_list, na.rm = TRUE), "\n")
## Recall@30: 0.003347666
cat("RMSE: ", mean(rmse_list, na.rm = TRUE), "\n")
## RMSE: 1.550275
The precision and recall aren’t great, possibly due to the small sample size of just 200 users. RMSE is under 2, and similar to UBCF (this is a 1-10 scale instead of a 1-5).
UBCF using default settings.
set.seed(1122)
scheme <- evaluationScheme(real_anime_matrix20,
method = "split",
train = 0.8,
#giving the model access to 15 ratings
#most users will have a few more than this
given = 15,
#ratings are on a 1-10 scale
goodRating = 6)
#create recommender
rec1 <- Recommender(getData(scheme, "train"), method = "UBCF")
#predictions - let's see the recommendations for these 6 users
pre <- predict(rec1, real_anime_matrix20[150:155], n = 5)
pre
## Recommendations as 'topNList' with n = 5 for 6 users.
as(pre, "list")
## $`0`
## [1] "Pandora Hearts" "Kuroshitsuji Special" "Bobobo-bo Bo-bobo"
## [4] "Nana" "Nana Recaps"
##
## $`1`
## [1] "Shoujo Kakumei Utena" "Kino no Tabi: The Beautiful World"
## [3] "Texhnolyze" "Touch"
## [5] "Gintama"
##
## $`2`
## [1] "Kuroko no Basket 2nd Season" "Kuroko no Basket 3rd Season"
## [3] "Tasogare Otome x Amnesia" "Golden Time"
## [5] "Ookami to Koushinryou"
##
## $`3`
## [1] "Haikyuu!!: Karasuno Koukou VS Shiratorizawa Gakuen Koukou"
## [2] "Detective Conan Movie 13: The Raven Chaser"
## [3] "Pokemon: Mewtwo no Gyakushuu"
## [4] "Mushishi"
## [5] "Mushishi Special: Hihamukage"
##
## $`4`
## [1] "Dragon Ball Z" "Street Fighter II: The Movie"
## [3] "Dragon Ball GT" "Strait Jacket"
## [5] "Mimi wo Sumaseba"
##
## $`5`
## [1] "Saiunkoku Monogatari"
## [2] "Angel Beats!: Another Epilogue"
## [3] "Kino no Tabi: The Beautiful World"
## [4] "Kino no Tabi: Nanika wo Suru Tame ni - Life Goes On."
## [5] "Planetes"
#check the accuracy
results <- evaluate(scheme, method = "UBCF", type = "ratings")
## UBCF run fold/sample [model time/prediction time]
## 1 [0.03sec/318.94sec]
avg(results)
## RMSE MSE MAE
## [1,] 1.579522 2.494889 1.212381
results_2 <- evaluate(scheme,
method = "UBCF",
type = "topNList",
n = c(5, 10, 20, 30, 40))
## UBCF run fold/sample [model time/prediction time]
## 1 [0.02sec/307.37sec]
avg(results_2)
## TP FP FN TN N precision recall
## [1,] 0.1967930 4.744898 95.93294 7826.465 7927.34 0.03982301 0.001935160
## [2,] 0.4125364 9.470845 95.71720 7821.739 7927.34 0.04174041 0.003984785
## [3,] 0.8090379 18.957726 95.32070 7812.252 7927.34 0.04092920 0.008059896
## [4,] 1.2871720 28.362974 94.84257 7802.847 7927.34 0.04341200 0.012854015
## [5,] 1.7689504 37.764577 94.36079 7793.445 7927.34 0.04474558 0.017576514
## TPR FPR n
## [1,] 0.001935160 0.0006056703 5
## [2,] 0.003984785 0.0012089195 10
## [3,] 0.008059896 0.0024199557 20
## [4,] 0.012854015 0.0036203765 30
## [5,] 0.017576514 0.0048202719 40
UBCF takes a long time to run due to the size of the data set. The RMSE and MSE are 1.58 and 2.5; RMSE is slightly higher than CBF. Precision and recall both outperform CBF.
In practice, it is a good idea to optimize the neighborhood size, but in this case, I will keep the default settings because the algorithm already takes a long time to run and the purpose of this project is to see how adding other elements, like diversity, affect accuracy.
set.seed(1122)
results_3 <- evaluate(scheme, method = "SVD", type = "ratings")
## SVD run fold/sample [model time/prediction time]
## 1 [4.33sec/1.87sec]
avg(results_3)
## RMSE MSE MAE
## [1,] 1.303997 1.700408 0.9808316
results_4 <- evaluate(scheme,
method = "SVD",
type = "topNList",
n = c(5, 10, 20, 30, 40))
## SVD run fold/sample [model time/prediction time]
## 1 [4.11sec/2.81sec]
#check the results
avg(results_4)
## TP FP FN TN N precision recall
## [1,] 0.00000000 4.941691 96.12974 7826.268 7927.34 0.0000000000 0.0000000000
## [2,] 0.00000000 9.883382 96.12974 7821.327 7927.34 0.0000000000 0.0000000000
## [3,] 0.00000000 19.766764 96.12974 7811.443 7927.34 0.0000000000 0.0000000000
## [4,] 0.01239067 29.637755 96.11735 7801.572 7927.34 0.0004178958 0.0001653730
## [5,] 0.04373178 39.489796 96.08601 7791.720 7927.34 0.0011061947 0.0003892627
## TPR FPR n
## [1,] 0.0000000000 0.0006312848 5
## [2,] 0.0000000000 0.0012625695 10
## [3,] 0.0000000000 0.0025251390 20
## [4,] 0.0001653730 0.0037861008 30
## [5,] 0.0003892627 0.0050445864 40
RMSE, etc. are actually better than UBCF, but precision and recall are much worse. Let’s find a better k, if possible:
k_to_test <- c(5, 10, 20, 30, 40, 50)
set.seed(1122)
results_by_k <- lapply(k_to_test, function(k) {
evaluate(scheme,
method = "SVD",
type = "topNList",
n = c(5, 10, 20, 30, 40),
parameter = list(k = k))
})
## SVD run fold/sample [model time/prediction time]
## 1 [2.88sec/2.69sec]
## SVD run fold/sample [model time/prediction time]
## 1 [4.16sec/2.8sec]
## SVD run fold/sample [model time/prediction time]
## 1 [6.16sec/2.91sec]
## SVD run fold/sample [model time/prediction time]
## 1 [8.61sec/2.88sec]
## SVD run fold/sample [model time/prediction time]
## 1 [10.25sec/3.17sec]
## SVD run fold/sample [model time/prediction time]
## 1 [13.03sec/3.07sec]
# check results
avg(results_by_k[[1]])
## TP FP FN TN N precision recall
## [1,] 0.00000000 4.941691 96.12974 7826.268 7927.34 0.0000000000 0.0000000000
## [2,] 0.00000000 9.883382 96.12974 7821.327 7927.34 0.0000000000 0.0000000000
## [3,] 0.00000000 19.766764 96.12974 7811.443 7927.34 0.0000000000 0.0000000000
## [4,] 0.01239067 29.637755 96.11735 7801.572 7927.34 0.0004178958 0.0001653730
## [5,] 0.04154519 39.491983 96.08819 7791.718 7927.34 0.0010508850 0.0003679217
## TPR FPR n
## [1,] 0.0000000000 0.0006312848 5
## [2,] 0.0000000000 0.0012625695 10
## [3,] 0.0000000000 0.0025251390 20
## [4,] 0.0001653730 0.0037861008 30
## [5,] 0.0003679217 0.0050448626 40
# Precision data frame
prec_df_svd <- do.call(rbind, lapply(seq_along(k_to_test), function(i) {
a <- avg(results_by_k[[i]])
data.frame(k = k_to_test[i],
n = a[, "n"],
precision = a[, "precision"])
}))
ggplot(prec_df_svd, aes(x = n, y = precision, color = factor(k), group = factor(k))) +
geom_line() + geom_point() +
labs(title = "SVD Precision by Dimensionality (k) and Number of Suggestions",
x = "Number of Suggestions", y = "Precision",
color = "Dimensionality (k)") +
theme_minimal()
# TPR data frame
tpr_df_svd <- do.call(rbind, lapply(seq_along(k_to_test), function(i) {
a <- avg(results_by_k[[i]])
data.frame(k = k_to_test[i],
n = a[, "n"],
TPR = a[, "TPR"])
}))
ggplot(tpr_df_svd, aes(x = n, y = TPR, color = factor(k), group = factor(k))) +
geom_line() + geom_point() +
labs(title = "SVD TPR (recall) by Dimensionality (k) and Number of Suggestions",
x = "Number of Suggestions", y = "TPR (recall)",
color = "Dimensionality (k)") +
theme_minimal()
Increasing dimensionality does not result in dramatic improvement. Some of these numbers are very similar, which is why they’re not showing up in the graph (overlap). For TPR, 40 and 50 have the same trajectory, 5 and 10 are similar. 50 dimensions does the best overall, but the next-best is 30, which is close, and takes less long to run.
Tested to see if this was a cold start problem - I tried this with a larger given (35) separately, and it didn’t result in much improvement.
set.seed(1122)
results_5 <- evaluate(scheme, method = "SVD", type = "ratings",
parameter = list(k = 30))
## SVD run fold/sample [model time/prediction time]
## 1 [7.39sec/2.09sec]
avg(results_5)
## RMSE MSE MAE
## [1,] 1.302501 1.696509 0.9794609
results_6 <- evaluate(scheme,
method = "SVD",
type = "topNList",
n = c(5, 10, 20, 30, 40),
parameter = list(k = 30))
## SVD run fold/sample [model time/prediction time]
## 1 [8.72sec/3.14sec]
#check the results
avg(results_6)
## TP FP FN TN N precision recall
## [1,] 0.000000000 4.941691 96.12974 7826.268 7927.34 0.000000e+00 0.000000e+00
## [2,] 0.000000000 9.883382 96.12974 7821.327 7927.34 0.000000e+00 0.000000e+00
## [3,] 0.000728863 19.766035 96.12901 7811.444 7927.34 3.687316e-05 3.816037e-06
## [4,] 0.013119534 29.637026 96.11662 7801.573 7927.34 4.424779e-04 1.691890e-04
## [5,] 0.046647230 39.486880 96.08309 7791.723 7927.34 1.179941e-03 4.205039e-04
## TPR FPR n
## [1,] 0.000000e+00 0.0006312848 5
## [2,] 0.000000e+00 0.0012625695 10
## [3,] 3.816037e-06 0.0025250445 20
## [4,] 1.691890e-04 0.0037860062 30
## [5,] 4.205039e-04 0.0050442072 40
results_p <- evaluate(scheme, method = "POPULAR", type = "ratings")
## POPULAR run fold/sample [model time/prediction time]
## 1 [0.05sec/2.15sec]
avg(results_p)
## RMSE MSE MAE
## [1,] 1.283353 1.646995 0.9779674
results_p2 <- evaluate(scheme,
method = "POPULAR",
type = "topNList",
n = c(5, 10, 20, 30, 40))
## POPULAR run fold/sample [model time/prediction time]
## 1 [0.03sec/38.85sec]
avg(results_p2)
## TP FP FN TN N precision recall
## [1,] 1.943878 2.997813 94.18586 7828.212 7927.34 0.3933628 0.03146229
## [2,] 3.414723 6.468659 92.71501 7824.741 7927.34 0.3455015 0.05522792
## [3,] 6.124636 13.642128 90.00510 7817.568 7927.34 0.3098451 0.09301004
## [4,] 7.968659 21.681487 88.16108 7809.528 7927.34 0.2687561 0.11676015
## [5,] 9.844752 29.688776 86.28499 7801.521 7927.34 0.2490229 0.14229725
## TPR FPR n
## [1,] 0.03146229 0.0003815112 5
## [2,] 0.05522792 0.0008237830 10
## [3,] 0.09301004 0.0017372958 20
## [4,] 0.11676015 0.0027618490 30
## [5,] 0.14229725 0.0037824240 40
#what are the recommendations?
rec3 <- Recommender(getData(scheme, "train"), method = "POPULAR")
# recommendations for the same 6 users
pre3 <- predict(rec3, real_anime_matrix20[150:155], n = 5)
as(pre3, "list")
## $`228`
## [1] "Sen to Chihiro no Kamikakushi" "Clannad: After Story"
## [3] "Tengen Toppa Gurren Lagann" "Mononoke Hime"
## [5] "Howl no Ugoku Shiro"
##
## $`230`
## [1] "Fullmetal Alchemist: Brotherhood" "Death Note"
## [3] "Code Geass: Hangyaku no Lelouch R2" "Code Geass: Hangyaku no Lelouch"
## [5] "Steins;Gate"
##
## $`231`
## [1] "Death Note" "Code Geass: Hangyaku no Lelouch R2"
## [3] "Code Geass: Hangyaku no Lelouch" "Steins;Gate"
## [5] "Shingeki no Kyojin"
##
## $`232`
## [1] "Fullmetal Alchemist: Brotherhood" "Code Geass: Hangyaku no Lelouch R2"
## [3] "Steins;Gate" "Shingeki no Kyojin"
## [5] "Sen to Chihiro no Kamikakushi"
##
## $`233`
## [1] "Fullmetal Alchemist: Brotherhood" "Death Note"
## [3] "Code Geass: Hangyaku no Lelouch R2" "Sen to Chihiro no Kamikakushi"
## [5] "Cowboy Bebop"
##
## $`234`
## [1] "Fullmetal Alchemist: Brotherhood" "Code Geass: Hangyaku no Lelouch R2"
## [3] "Code Geass: Hangyaku no Lelouch" "Steins;Gate"
## [5] "Clannad: After Story"
The predicted lists are similar for these users, excluding shows or movies they’ve already seen. The RMSE is lower than UBCF, and precision/recall are higher. The same is true for CBF. SVD’s RMSE comes close, so while the model is bad at picking true positives (choosing items the user has watched) it’s good at making accurate predictions.
rec4 <- Recommender(getData(scheme, "train"), method = "RANDOM")
results_r <- evaluate(scheme, method = "RANDOM", type = "ratings")
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.01sec/1sec]
avg(results_r)
## RMSE MSE MAE
## [1,] 3.803858 14.46934 3.110573
results_r2 <- evaluate(scheme,
method = "RANDOM",
type = "topNList",
n = c(5, 10, 20, 30, 40))
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.01sec/2.42sec]
avg(results_r2)
## TP FP FN TN N precision recall
## [1,] 0.06413994 4.935860 96.06560 7826.274 7927.34 0.01282799 0.0005932556
## [2,] 0.12609329 9.873907 96.00364 7821.336 7927.34 0.01260933 0.0013445853
## [3,] 0.25437318 19.745627 95.87536 7811.464 7927.34 0.01271866 0.0026898599
## [4,] 0.38192420 29.618076 95.74781 7801.592 7927.34 0.01273081 0.0039833604
## [5,] 0.49052478 39.509475 95.63921 7791.700 7927.34 0.01226312 0.0049503555
## TPR FPR n
## [1,] 0.0005932556 0.0006302935 5
## [2,] 0.0013445853 0.0012609300 10
## [3,] 0.0026898599 0.0025215154 20
## [4,] 0.0039833604 0.0037821578 30
## [5,] 0.0049503555 0.0050452918 40
#not displaying the recs for 6 users because they won't tell us anything
RMSE is really high at ~3.8 (higher than everything else), precision and recall are better than SVD, which is concerning for SVD.
Looking at UBCF, SVD, popular, and Random for the first comparison. I will add CBF’s more limited metrics after.
extract_avg <- function(result, method_name) {
as.data.frame(avg(result)) |>
mutate(Method = method_name)
}
combined_df <- bind_rows(
extract_avg(results_2, "UBCF"),
extract_avg(results_6, "SVD (k=30)"),
extract_avg(results_p2, "Popular"),
extract_avg(results_r2, "Random")
)
combined_df
## TP FP FN TN N precision recall
## 1 0.196793003 4.744898 95.93294 7826.465 7927.34 3.982301e-02 1.935160e-03
## 2 0.412536443 9.470845 95.71720 7821.739 7927.34 4.174041e-02 3.984785e-03
## 3 0.809037901 18.957726 95.32070 7812.252 7927.34 4.092920e-02 8.059896e-03
## 4 1.287172012 28.362974 94.84257 7802.847 7927.34 4.341200e-02 1.285402e-02
## 5 1.768950437 37.764577 94.36079 7793.445 7927.34 4.474558e-02 1.757651e-02
## 6 0.000000000 4.941691 96.12974 7826.268 7927.34 0.000000e+00 0.000000e+00
## 7 0.000000000 9.883382 96.12974 7821.327 7927.34 0.000000e+00 0.000000e+00
## 8 0.000728863 19.766035 96.12901 7811.444 7927.34 3.687316e-05 3.816037e-06
## 9 0.013119534 29.637026 96.11662 7801.573 7927.34 4.424779e-04 1.691890e-04
## 10 0.046647230 39.486880 96.08309 7791.723 7927.34 1.179941e-03 4.205039e-04
## 11 1.943877551 2.997813 94.18586 7828.212 7927.34 3.933628e-01 3.146229e-02
## 12 3.414723032 6.468659 92.71501 7824.741 7927.34 3.455015e-01 5.522792e-02
## 13 6.124635569 13.642128 90.00510 7817.568 7927.34 3.098451e-01 9.301004e-02
## 14 7.968658892 21.681487 88.16108 7809.528 7927.34 2.687561e-01 1.167602e-01
## 15 9.844752187 29.688776 86.28499 7801.521 7927.34 2.490229e-01 1.422973e-01
## 16 0.064139942 4.935860 96.06560 7826.274 7927.34 1.282799e-02 5.932556e-04
## 17 0.126093294 9.873907 96.00364 7821.336 7927.34 1.260933e-02 1.344585e-03
## 18 0.254373178 19.745627 95.87536 7811.464 7927.34 1.271866e-02 2.689860e-03
## 19 0.381924198 29.618076 95.74781 7801.592 7927.34 1.273081e-02 3.983360e-03
## 20 0.490524781 39.509475 95.63921 7791.700 7927.34 1.226312e-02 4.950356e-03
## TPR FPR n Method
## 1 1.935160e-03 0.0006056703 5 UBCF
## 2 3.984785e-03 0.0012089195 10 UBCF
## 3 8.059896e-03 0.0024199557 20 UBCF
## 4 1.285402e-02 0.0036203765 30 UBCF
## 5 1.757651e-02 0.0048202719 40 UBCF
## 6 0.000000e+00 0.0006312848 5 SVD (k=30)
## 7 0.000000e+00 0.0012625695 10 SVD (k=30)
## 8 3.816037e-06 0.0025250445 20 SVD (k=30)
## 9 1.691890e-04 0.0037860062 30 SVD (k=30)
## 10 4.205039e-04 0.0050442072 40 SVD (k=30)
## 11 3.146229e-02 0.0003815112 5 Popular
## 12 5.522792e-02 0.0008237830 10 Popular
## 13 9.301004e-02 0.0017372958 20 Popular
## 14 1.167602e-01 0.0027618490 30 Popular
## 15 1.422973e-01 0.0037824240 40 Popular
## 16 5.932556e-04 0.0006302935 5 Random
## 17 1.344585e-03 0.0012609300 10 Random
## 18 2.689860e-03 0.0025215154 20 Random
## 19 3.983360e-03 0.0037821578 30 Random
## 20 4.950356e-03 0.0050452918 40 Random
#Comparison at 30 rec's
combined_df |>
filter(n == 30) |>
select(Method, precision, recall, TPR, FPR) |>
knitr::kable(digits = 4, caption = "Comparison at 30 Recommendations")
| Method | precision | recall | TPR | FPR |
|---|---|---|---|---|
| UBCF | 0.0434 | 0.0129 | 0.0129 | 0.0036 |
| SVD (k=30) | 0.0004 | 0.0002 | 0.0002 | 0.0038 |
| Popular | 0.2688 | 0.1168 | 0.1168 | 0.0028 |
| Random | 0.0127 | 0.0040 | 0.0040 | 0.0038 |
#bar chart
combined_df |>
filter(n == 30) |>
select(Method, precision, recall) |>
pivot_longer(c(precision, recall), names_to = "Metric", values_to = "Value") |>
ggplot(aes(x = Method, y = Value, fill = Metric)) +
geom_col(position = "dodge") +
labs(title = "Precision & Recall at N=30", x = NULL, y = NULL) +
theme_minimal()
SVD is so low it doesn’t even show up on the bar chart. UBCF does significantly better than random, but popular is still a high baseline.
#Precision
ggplot(combined_df, aes(x = n, y = precision, color = Method)) +
geom_line(linewidth = 1) + geom_point() +
labs(title = "Precision by Method",
x = "Number of Recommendations", y = "Precision") +
theme_minimal()
#Reall/true positive rate
ggplot(combined_df, aes(x = n, y = TPR, color = Method)) +
geom_line(linewidth = 1) + geom_point() +
labs(title = "Recall (TPR) by Method",
x = "Number of Recommendations", y = "Recall (TPR)") +
theme_minimal()
#ROC curve
ggplot(combined_df, aes(x = FPR, y = TPR, color = Method)) +
geom_line(linewidth = 1) + geom_point(shape = 3) +
labs(title = "ROC Curve by Method",
x = "FPR", y = "TPR") +
theme_minimal()
Based on the ROC curve, SVD does worse than random, possibly because the data set is so sparse, and there are only a few false positives and true positives overall. With a large rating scale, 1-10, SVD may also struggle to normalize.
Popular does better in every case, but UBCF improves with more recommendations and is better than random.
Adding in collaborative filtering (there’s only one k, 30, for this model):
cbf_row <- data.frame(
Method = "CBF",
n = 30,
precision = mean(precision_list, na.rm = TRUE),
recall = mean(recall_list, na.rm = TRUE)
)
combined_df2 <- bind_rows(
extract_avg(results_2, "UBCF"),
extract_avg(results_6, "SVD (k=30)"),
extract_avg(results_p2, "Popular"),
extract_avg(results_r2, "Random")
) |>
filter(n == 30) |>
select(Method, n, precision, recall) |>
bind_rows(cbf_row)
#bar chart
combined_df2 |>
filter(n == 30) |>
select(Method, precision, recall) |>
mutate(Method = reorder(Method, -precision)) |>
pivot_longer(c(precision, recall), names_to = "Metric", values_to = "Value") |>
ggplot(aes(x = Method, y = Value, fill = Metric)) +
geom_col(position = "dodge") +
labs(title = "Precision & Recall at 30 Suggestions", x = NULL, y = NULL) +
theme_minimal()
CBF’s precision and recall were worse than everything but SVD. However, that’s not true for RMSE:
rmse_df <- data.frame(
Algorithm = c("UBCF", "SVD", "Popular", "Random", "CBF"),
RMSE = c(
avg(results)[,"RMSE"],
avg(results_5)[,"RMSE"],
avg(results_p)[,"RMSE"],
avg(results_r)[,"RMSE"],
mean(rmse_list, na.rm = TRUE)
)
)
rmse_df
## Algorithm RMSE
## 1 UBCF 1.579522
## 2 SVD 1.302501
## 3 Popular 1.283353
## 4 Random 3.803858
## 5 CBF 1.550275
rmse_df |> arrange(desc(RMSE)) |>
ggplot(aes(x = reorder(Algorithm, -RMSE), y = RMSE, fill = Algorithm)) +
geom_col() +
labs(title = "Algorithm by RMSE (at 30 Recommendations)",
x = "Algorithm") +
theme_minimal() +
theme(legend.position = "none")
CF is in the middle for RMSE. Precision and recall might not be great, but the model is able to predict scores better than UBCF. SVD and popular (popular items score well) are close, while UBCF is slightly worse, and random is very high. SVD’s RMSE comes close, so, again, while the model is bad at picking true positives (choosing items the user has watched) it’s good at making accurate predictions on the ones it does choose. It may be better at helping users discover items they haven’t explored yet.
However, because SVD’s precision and recall were so low, I’m only going to modify UBCF and CBF to include other types of recommendations.
I’ll add recommendation diversity in two ways . First, I’m adding a random element to the UBCF recommender (20% random), which should give the results variety but make the metrics slightly worse. I’m using a smaller set of 800 users, since the hybrid recommender took a long time to run (~20 minutes).
# sample 800 users
set.seed(1122)
user_sample <- sample(1:nrow(real_anime_matrix20), 800)
small_matrix <- real_anime_matrix20[user_sample, ]
# train/test split on 800 users
scheme_small <- evaluationScheme(small_matrix,
method = "split",
train = 0.8,
given = 15,
goodRating = 6)
test_data <- getData(scheme_small, "known")
test_unknown <- getData(scheme_small, "unknown")
#retrain models on small scheme
rec1_s <- Recommender(getData(scheme_small,
"train"),
method = "UBCF")
rec4_s <- Recommender(getData(scheme_small,
"train"),
method = "RANDOM")
#adding 20% random suggestions for diversity
diversity <- HybridRecommender(
rec1_s, #UBCF
rec4_s, #random
weights = c(0.8, 0.2)
)
preds_div <- predict(diversity, test_data, type = "ratings")
#check accuracy (RMSE MAE MSE)
acc <- calcPredictionAccuracy(preds_div, test_unknown)
acc
## RMSE MSE MAE
## 2.720935 7.403486 2.077912
preds_topn <- predict(diversity,
test_data,
type = "topNList",
n = 30)
#check true positives, etc.
acc_topn <- calcPredictionAccuracy(preds_topn,
test_unknown,
given = 15,
goodRating = 6)
acc_topn
## TP FP FN TN N precision
## 3.687500e-01 2.963125e+01 1.103375e+02 7.786663e+03 7.927000e+03 1.229167e-02
## recall TPR FPR
## 3.228892e-03 3.228892e-03 3.790708e-03
Adding random recommendations made precision, recall, and RMSE a lot worse, predictably (this may also have to do with the smaller user set, but I was able to run this once on the original data set and the result was similar).
Another way to add diversity is to include popular recommendations: items with broad appeal will be interspersed with more personalized recommendations. This makes sense for anime more than, for example, books, since the medium itself is already an interest.
#retrain popular model on small data set
rec3_s <- Recommender(getData(scheme_small,
"train"),
method = "POPULAR")
#adding 30% popular suggestions for diversity
diversity2 <- HybridRecommender(
rec1_s, #UBCF
rec3_s, #popular
weights = c(0.7, 0.3))
#predict
preds_div2 <- predict(diversity2,
test_data,
type = "ratings")
#check accuracy (RMSE, etc.)
acc2 <- calcPredictionAccuracy(preds_div2,
test_unknown)
acc2
## RMSE MSE MAE
## 1.400226 1.960633 1.060313
preds_topn2 <- predict(diversity2,
test_data, type = "topNList",
n = 30)
acc_topn2 <- calcPredictionAccuracy(preds_topn2,
test_unknown,
given = 15,
goodRating = 6)
acc_topn2
## TP FP FN TN N precision
## 7.500000e-02 2.842500e+01 1.106312e+02 7.787869e+03 7.927000e+03 2.631579e-03
## recall TPR FPR
## 3.707726e-04 3.707726e-04 3.639870e-03
The RMSE came down a bit, and it’s a little closer to popular and SVD. Precision and recall both went down as well.
This uses Maximal Marginal Relevance, which re-ranks predictions based on similarity to items it already predicted. It picks the top n (30) items, but one by one, balancing the list based on similarity to previously selected itms. A lambda of .7 means 70% relevance, 30% avoiding redundant/similar choices.
# use cosine similarity previously calculated (no need to calculate it again)
cbf_predict_mmr <- function(user_ratings, item_sim, n = 30, lambda = 0.7) {
rated_idx <- which(!is.na(user_ratings))
unrated_idx <- which(is.na(user_ratings))
#base relevance scores
scores <- sapply(unrated_idx, function(i) {
sims <- item_sim[i, rated_idx]
rats <- user_ratings[rated_idx]
sum(sims * rats) / (sum(abs(sims)) + 1e-9)
})
#MMR reranking
n_candidates <- length(unrated_idx)
selected_pos <- c()
remaining <- seq_len(n_candidates)
for (k in 1:n) {
if (length(selected_pos) == 0) {
best_pos <- which.max(scores[remaining])
} else {
redundancy <- apply(item_sim[unrated_idx[remaining], unrated_idx[selected_pos], drop = FALSE], 1, max)
mmr <- lambda * scores[remaining] - (1 - lambda) * redundancy
best_pos <- which.max(mmr)
}
selected_pos <- c(selected_pos, remaining[best_pos])
remaining <- remaining[-best_pos]
}
result <- scores[selected_pos]
names(result) <- names(user_ratings)[unrated_idx[selected_pos]]
result
}
set.seed(1122)
test_users <- sample(1:n_users, 200)
precision_list_cd <- c()
recall_list_cd <- c()
rmse_list_cd <- c()
for (i in test_users) {
user_ratings <- rating_m[i, ]
rated_idx <- which(!is.na(user_ratings))
if (length(rated_idx) < 5) next
test_idx <- sample(rated_idx, max(1, floor(0.2 * length(rated_idx))))
train_ratings <- user_ratings
train_ratings[test_idx] <- NA
preds <- cbf_predict_mmr(train_ratings, item_sim, n = 30, lambda = 0.7)
pred_items <- names(preds)
relevant <- names(user_ratings[test_idx][user_ratings[test_idx] >= 6])
tp <- length(intersect(pred_items, relevant))
precision_list_cd[i] <- tp / length(pred_items)
recall_list_cd[i] <- ifelse(length(relevant) > 0, tp / length(relevant), NA)
common <- intersect(pred_items, names(user_ratings[test_idx]))
if (length(common) > 0) {
actual <- user_ratings[test_idx][common]
predicted <- preds[common]
rmse_list_cd[i] <- sqrt(mean((actual - predicted)^2))
}
}
cat("Precision@30:", mean(precision_list_cd, na.rm = TRUE), "\n")
## Precision@30: 0.003333333
cat("Recall@30: ", mean(recall_list_cd, na.rm = TRUE), "\n")
## Recall@30: 0.006107851
cat("RMSE: ", mean(rmse_list_cd, na.rm = TRUE), "\n")
## RMSE: 1.420619
RMSE is actually a little better than the original model, as are precision and recall. This could be because alternating between genres give the model the option make predictions from different pools. So, in a situation where each genre would have a bunch of top options that many people have seen, diversifying predictions may make it easier to predict ratings.
rmse_df_hybrid <- data.frame(
Algorithm = c("UBCF", "SVD", "Popular", "Random",
"CBF", "UBCF-Random",
"UBCF-Popular", "CBF Diversity"),
RMSE = c(
avg(results)[,"RMSE"],
avg(results_5)[,"RMSE"],
avg(results_p)[,"RMSE"],
avg(results_r)[,"RMSE"],
mean(rmse_list, na.rm = TRUE),
acc["RMSE"],
acc2["RMSE"],
mean(rmse_list_cd, na.rm = TRUE)
)
)
rmse_df_hybrid
## Algorithm RMSE
## 1 UBCF 1.579522
## 2 SVD 1.302501
## 3 Popular 1.283353
## 4 Random 3.803858
## 5 CBF 1.550275
## 6 UBCF-Random 2.720935
## 7 UBCF-Popular 1.400226
## 8 CBF Diversity 1.420619
rmse_df_hybrid |> arrange(desc(RMSE)) |>
ggplot(aes(x = reorder(Algorithm, -RMSE), y = RMSE, fill = Algorithm)) +
geom_col() +
labs(title = "Algorithm by RMSE (at 30 Recommendations)",
x = "Algorithm") +
theme_minimal() +
theme(legend.position = "none")
Popular still has the lowest RMSE, but UBCF-Popular comes close, and CBF with diversity performs better than the original CBF algorithm.
cbf_row2 <- data.frame(
Method = "CBF Diversity",
n = 30,
precision = mean(precision_list_cd, na.rm = TRUE),
recall = mean(recall_list_cd, na.rm = TRUE)
)
#add cbf and make data frames that can be added to the table
hybrid1_row <- data.frame(
Method = "UBCF-Random",
n = 30,
precision = acc_topn["precision"],
recall = acc_topn["recall"]
)
hybrid2_row <- data.frame(
Method = "UBCF-Popular",
n = 30,
precision = acc_topn2["precision"],
recall = acc_topn2["recall"]
)
combined_df3 <- combined_df2 |>
bind_rows(hybrid1_row, hybrid2_row) |>
select(Method, n, precision, recall) |>
bind_rows(cbf_row2)
combined_df3
## Method n precision recall
## ...1 UBCF 30 0.0434119961 0.0128540152
## ...2 SVD (k=30) 30 0.0004424779 0.0001691890
## ...3 Popular 30 0.2687561455 0.1167601506
## ...4 Random 30 0.0127308066 0.0039833604
## ...5 CBF 30 0.0023333333 0.0033476662
## precision...6 UBCF-Random 30 0.0122916667 0.0032288916
## precision...7 UBCF-Popular 30 0.0026315789 0.0003707726
## ...8 CBF Diversity 30 0.0033333333 0.0061078514
#bar chart
combined_df3 |>
filter(n == 30) |>
select(Method, precision, recall) |>
mutate(Method = reorder(Method, -precision)) |>
pivot_longer(c(precision, recall), names_to = "Metric", values_to = "Value") |>
ggplot(aes(x = Method, y = Value, fill = Metric)) +
geom_col(position = "dodge") +
labs(title = "Precision & Recall at 30 Suggestions", x = NULL, y = NULL) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45))
For both UBCF hybrid models, precision and recall are significantly worse. For CBF as a hybrid recommender, recall improved, but not precision.
Having CBF predict a more diverse set of items resulted in a lower RMSE. Recall and precision also improved slightly, which suggests that diversifying genre recommendations allows the model to find more of the relevant cases. This test was only run on a pool of 200 users, so, it’s possible that if run on the entire data set, it could perform even better. CBF + diversity also incorporates a more sophisticated model than mixing UBCF with random or popular recommendations.
As predicted, adding random recommendations to the UBCF algorithm made the RMSE, precision, and recall worse. Having a bunch of random suggestions thrown in probably does not improve the user experience for most people. Recommending popular items, while it improves RMSE in testing, might not add a lot of value for many users, since they likely already know about these items. It’s almost like giving the model a cheat sheet of what most users have watched. Precision and recall were also worse, possibly because UBCF allowed the model to identify most popular items through user filtering, and it ended up picking from less popular items than the pure popular model.
For a hybrid recommender, I would pick CBF-diversity. It takes user preferences and builds in different genre recommendations based on their interests. That seems like a better experience than being recommended all Gundam anime because that’s a large percentage of what a user watched or what they generally rated better.
Some possible UX experiments:
Whether providing reasoning for the recommendation could be helpful, e.g., “Because you watched x, you might enjoy y.” This could be evaluated using a/b testing, where some users see recommendations + reasoning, while others just see recommendations without context. Are the users who see the reasoning behind their recommendations more likely to click through?
Provide an option to choose favorite shows, so recommendations can be more aligned with those (Letterboxd does a version of this). To evaluate how well this works, keep 50% of recommendations aligned with favorite shows, 50% algined with other user preferences, and see which users engage with more. Or, pilot on only on a portion of users and evaluate.
Sometimes, people are in the mood for a certain type of show. There could be an interactive element, like asking the user, “what do you feel like watching?” Then, they choose a genre and other specifiers, like whether it’s a movie or series, or how long it is. Then, evaluate whether those recommendations that incorporate additional specifications were more successful than passive ones.
I added popular recommendations to UBCF, but this doesn’t help users stay on top of what’s trending. Those recommendations are all-time, not what’s popular this week, month, etc. This could be implemented via a new/everyone’s watching feature (novelty + popularity) for the week or month, perhaps on the home page or alongside personalized recommendations, and testing how much users clicked on those or started watching compared to their personalized recommendations.
Considering engagement outside of ratings: Measuring whether the user clicked and visited the page for the show, whether they engaged partially (e.g., watched a few episodes and then stopped), or whether they added the show to their watch list. This gives clues as to what the user is interested in, and could be seen as a successful recommendation. Series take a long time to watch, and they may come back to a recommendation later.
Claude Sonnet, 4.6. [Large language model] Accessed June 2026. Claude.ai