library(tidyverse)
library(recommenderlab)

Load the data set

This is part of an anime data set scraped from myanimelist. I used the first million rows of user data and the entire item data set. This is not the full data set.

https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database?resource=download

#read CSVs
user_ratings <- read.csv('https://raw.githubusercontent.com/samanthabarbaro/data612_recommender_systems/refs/heads/main/animeratings.csv', na.strings = c("", "NA", "null", "NULL"),  header = TRUE)

item_info <- read.csv('https://raw.githubusercontent.com/samanthabarbaro/data612_recommender_systems/refs/heads/main/anime.csv', na.strings = c("", "NA", "null", "NULL"),  header = TRUE)

#number of user ratings
count(user_ratings)
##         n
## 1 1048575
#number of shows/movies
count(item_info)
##       n
## 1 12294
#number of ratings per user: 
#user_ratings |> count(user_id)

#number of users
n_distinct(user_ratings$user_id)
## [1] 10093

We have over 1 million anime ratings by about 10k users.

Create the user-item matrix

If a user marked an anime as watched but didn’t rate it, it’s scored as -1. I am going to filter those out because we are looking at algorithms that predict scores, and this would throw off the baseline significantly. However, it would be good for an implicit ratings set. Anime is a good candidate for implicit ratings - there are movies here, but fans probably wouldn’t mark an entire series as “watched” unless they liked it at least a little.

user_ratings <- user_ratings |>
  filter(rating > -1)

#854k ratings left
count(user_ratings)
##        n
## 1 854113

Join the user ratings matrix with the anime name based on anime ID:

#remove weird text from item info first
item_info$name <- gsub("&quot;", "", item_info$name)
item_info$name <- gsub("amp;", "", item_info$name)

#join
user_item <- left_join(user_ratings, item_info |>
                         select(anime_id, name), by = "anime_id")

#remove the anime ID
user_item <- user_item |>
  select(user_id, rating, name)

More exploratory analysis to see how many ratings users have.

anime_table <- pivot_wider(
  data = user_item,
  names_from = name,
  values_from = rating
) 

rating_counts <- rowSums(!is.na(anime_table |> select(-user_id)))

table(rating_counts)
## rating_counts
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##  473  300  253  195  186  164  159  150  118  126  136  124   91  102  119  102 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  109   93   85  105   84   74   89   90   72   78   83   82   66   65   71   70 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##   54   68   81   46   57   62   57   54   55   51   60   46   52   71   56   57 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##   46   50   36   60   40   49   48   39   49   37   46   42   56   42   48   38 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
##   37   41   49   41   40   27   37   30   25   35   35   43   39   28   30   23 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
##   24   31   27   25   28   34   23   31   28   31   30   32   33   32   25   37 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
##   29   35   24   16   29   22   27   22   23   24   23   28   22   22   36   22 
##  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128 
##   23   25   21   17   18   18   18   14   21   19   12   19   17   22   29   22 
##  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144 
##   18   20   26   19   24   14   14   20   14   13   20   15    9   11   20    9 
##  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160 
##   14   12   15   18   15   16   12   13   10   11   15   19   14    8    7   22 
##  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  176 
##   18    9   12   11   13    9   12   11   11    8    7   18    8   11   14   10 
##  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  192 
##    5   10   15   13    5   12   12    9   10    7   12   10   10    9   10    7 
##  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207  208 
##    9    7    5    6    9   10    8    8   10    5    5    5    8    7    8    9 
##  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223  224 
##    8    6    6    5   11    1    9    6   11    2    7    4    6    6    9    8 
##  225  226  227  228  229  230  231  232  233  234  235  236  237  238  239  240 
##    5    5   10    5    5    7    8    6    6    9    5   11    9    7    7   11 
##  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255  256 
##    8    6    1    5    4    5    3   12   10    9    4    5    6    4    9    7 
##  257  258  259  260  261  262  263  264  265  266  267  268  269  270  271  272 
##    3    4    4    6    7    5    2    8   10    8    3    2    2   11    6    2 
##  273  274  275  276  277  278  279  280  281  282  283  284  285  286  287  288 
##    4    5    7    4    5    4    2    4    6    3    6    5    3    5    3    3 
##  289  290  291  292  293  294  295  296  297  298  299  300  301  302  303  304 
##    2    3    2    2    2    4    8    5    3    5    5    1    2    6    5    6 
##  305  306  307  308  309  310  311  312  313  314  315  316  317  318  319  320 
##    1    3    3    4    3    4    3    1    4    4    2    6    3    3    2    1 
##  321  322  323  324  325  326  327  328  329  330  331  332  333  334  335  336 
##    4    3   10    1    6    5    2    8    2    3    4    4    1    5    3    1 
##  337  338  339  340  341  342  343  344  345  346  347  348  349  350  351  352 
##    1    2    2    2    2    3    5    3    4    6    4    5    2    2    1    1 
##  353  354  355  356  357  358  359  360  361  362  363  364  365  367  368  369 
##    5    6    3    1    4    2    2    1    3    2    2    2    2    3    8    2 
##  370  371  372  373  374  375  376  377  378  379  381  382  383  384  385  386 
##    3    2    1    5    1    1    3    2    1    3    2    2    1    2    6    1 
##  387  389  391  392  393  394  395  396  397  398  400  401  402  403  404  405 
##    1    2    2    5    2    1    1    1    1    4    1    1    1    2    2    1 
##  406  407  409  410  412  413  414  415  416  417  418  419  420  422  423  425 
##    4    4    1    5    2    1    1    2    2    3    1    1    2    2    1    1 
##  427  428  429  430  431  432  433  434  435  436  437  438  439  440  441  442 
##    1    3    1    1    3    1    4    1    3    3    1    2    2    1    1    1 
##  443  444  445  446  447  448  449  450  451  453  455  458  459  460  461  462 
##    3    1    2    1    1    2    1    4    1    1    1    1    3    1    1    2 
##  463  465  468  469  471  472  473  474  476  477  482  484  485  486  487  488 
##    1    2    1    1    1    1    2    2    1    2    1    1    1    2    1    1 
##  489  494  495  497  498  499  500  502  504  506  507  511  513  514  516  519 
##    1    1    2    1    1    2    1    1    2    1    2    1    1    3    3    2 
##  521  522  524  525  527  528  529  534  540  541  545  546  547  550  551  555 
##    1    1    1    2    1    1    2    1    1    1    1    1    1    2    1    1 
##  560  562  563  567  568  569  570  571  572  574  575  578  581  582  584  589 
##    1    2    1    2    1    1    2    1    1    1    2    1    1    1    1    1 
##  593  598  599  600  602  607  608  609  610  611  615  616  617  621  622  623 
##    1    1    2    1    1    1    3    2    2    1    1    1    1    1    1    1 
##  624  630  631  636  638  641  650  651  656  657  660  662  663  665  666  670 
##    1    2    1    1    1    1    1    1    2    1    1    1    1    1    2    1 
##  671  672  675  682  685  686  693  695  698  701  707  715  719  735  742  743 
##    1    1    1    1    2    1    1    1    1    1    1    1    1    1    1    1 
##  750  751  753  774  775  781  783  790  795  801  803  807  819  820  821  830 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  834  835  837  843  849  854  857  863  877  878  880  903  904  912  940  943 
##    1    1    1    2    1    1    2    2    1    1    1    1    1    1    1    1 
##  951  952  976  989  990  993  998 1001 1011 1012 1013 1019 1032 1040 1041 1042 
##    1    1    1    1    1    1    1    1    1    1    2    1    1    1    1    1 
## 1052 1088 1132 1154 1156 1180 1188 1227 1338 1403 1412 1584 1702 2429 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1
rating_counts_2 <- as_tibble(rating_counts)


#histogram of ratings
rating_counts_2 |> ggplot(aes(x = value)) +
  geom_histogram(bins = 60) +
  labs(title = "Number of Ratings by User", x = "number of ratings") +
  theme_minimal()

#one user has 2429 ratings 
max(rating_counts)
## [1] 2429
#some users only have 1
min(rating_counts)
## [1] 1
#avg user rated 89 shows/movies
mean(rating_counts)
## [1] 89.38912

Pivot wider and make a user-item matrix:

anime_matrix <- pivot_wider(
  data = user_item,
  names_from = name,
  values_from = rating
) |>
  #make user_id the row name
  column_to_rownames("user_id") |> 
  as.matrix()   

#check dimensions
dim(anime_matrix)
## [1] 9555 7942
#create real ratings matrix
real_anime_matrix <- as(anime_matrix, "realRatingMatrix")

I will exclude users with fewer than 15 ratings, which seems reasonable given the range of ratings in this data set:

#min 15 ratings
real_anime_matrix20 <- real_anime_matrix[rowCounts(
  real_anime_matrix) > 15, ]

Creating an item matrix

Here, I turn the genre information about each anime into an item matrix. I do this by breaking the genre column up, getting rid of weird characters that cause matching issues, and pivoting wider. The item matrix will allow me to create a content-based collaborative filtering system.

anime_rows <- separate_longer_delim(item_info, cols = genre, delim = ",")

anime_rows <- anime_rows |> select(name, genre, type)

#clean it up
anime_rows$name <- gsub("amp;", "", anime_rows$name)


item_features <- anime_rows |>
  distinct(name, genre) |>
  mutate(genre_val = 1) |>
  pivot_wider(names_from = genre, values_from = genre_val,
              values_fill = 0, values_fn = max) |>
  column_to_rownames("name")

Creating a recommender: content-based collaborative filtering

Test to find common items - I chose to identify anime by title, not ID, which can get messy when matching two matrices. With this many entries, weird characters are likely to import. Here, I make sure everything that’s in the user matrix is in the item matrix.

common_items <- intersect(rownames(item_features),
                          colnames(real_anime_matrix20))

item_features_aligned <- item_features[common_items, ]

rating_matrix_aligned <- real_anime_matrix20[, common_items]

#check to make sure no items were left out because of weird punctuation
missing <- setdiff(colnames(real_anime_matrix20), rownames(item_features))

missing2 <- setdiff(rownames(item_features), colnames(real_anime_matrix20))

length(missing)
## [1] 0
length(missing2)
## [1] 4350

There are no items that have been rated by users that aren’t in the item matrix, but there are items that have not been rated by users. This makes sense - I am only using a portion of the original data set.

Calculate cosine similarity between items:

library(proxy)
# compute cosine similarity 

item_sim <- as.matrix(simil(as.matrix(item_features), method = "cosine"))
item_sim_matrix <- as.matrix(item_sim)

Predict scores with 15 visible ratings:

cbf_predict <- function(user_ratings, item_sim, n = 30) {
  rated_idx <- which(!is.na(user_ratings))
  unrated_idx <- which(is.na(user_ratings))
  
  scores <- sapply(unrated_idx, function(i) {
    sims <- item_sim[i, rated_idx]
    rats <- user_ratings[rated_idx]
    sum(sims * rats) / (sum(abs(sims)) + 1e-9)
  })
  
  names(scores) <- names(user_ratings)[unrated_idx]
  sort(scores, decreasing = TRUE)[1:n]
}

Evaluate (I had to adjust the number of times this runs and take the user list down to 200 because the user list was too large):

#loop over users
set.seed(1122)
rating_m <- as(rating_matrix_aligned, "matrix")
n_users <- nrow(rating_m)
test_users <- sample(1:n_users, 200)

precision_list <- c()
recall_list <- c()
rmse_list <- c()

for (i in test_users) {
  user_ratings <- rating_m[i, ]
  rated_idx <- which(!is.na(user_ratings))
  if (length(rated_idx) < 5) next

  test_idx <- sample(rated_idx, max(1, floor(0.2 * length(rated_idx))))
  train_ratings <- user_ratings
  train_ratings[test_idx] <- NA

  preds <- cbf_predict(train_ratings, item_sim, n = 30)
  pred_items <- names(preds)

  relevant <- names(user_ratings[test_idx][user_ratings[test_idx] >= 6])
  tp <- length(intersect(pred_items, relevant))
  precision_list[i] <- tp / length(pred_items)
  recall_list[i]    <- ifelse(length(relevant) > 0, tp / length(relevant), NA)

  common <- intersect(pred_items, names(user_ratings[test_idx]))
  if (length(common) > 0) {
    actual <- user_ratings[test_idx][common]
    predicted <- preds[common]
    rmse_list[i] <- sqrt(mean((actual - predicted)^2))
  }
}

#calculate metrics
cat("Precision@30:", mean(precision_list, na.rm = TRUE), "\n")
## Precision@30: 0.002333333
cat("Recall@30:   ", mean(recall_list, na.rm = TRUE), "\n")
## Recall@30:    0.003347666
cat("RMSE:        ", mean(rmse_list, na.rm = TRUE), "\n")
## RMSE:         1.550275

The precision and recall aren’t great, possibly due to the small sample size of just 200 users. RMSE is under 2, and similar to UBCF (this is a 1-10 scale instead of a 1-5).

Creating a recommender - UBCF

UBCF using default settings.

set.seed(1122)
scheme <- evaluationScheme(real_anime_matrix20, 
                           method = "split", 
                           train = 0.8, 
                           #giving the model access to 15 ratings
                           #most users will have a few more than this
                           given = 15, 
                           #ratings are on a 1-10 scale
                           goodRating = 6)

#create recommender
rec1 <- Recommender(getData(scheme, "train"), method = "UBCF")

#predictions - let's see the recommendations for these 6 users

pre <- predict(rec1, real_anime_matrix20[150:155], n = 5)
pre
## Recommendations as 'topNList' with n = 5 for 6 users.
as(pre, "list")
## $`0`
## [1] "Pandora Hearts"       "Kuroshitsuji Special" "Bobobo-bo Bo-bobo"   
## [4] "Nana"                 "Nana Recaps"         
## 
## $`1`
## [1] "Shoujo Kakumei Utena"              "Kino no Tabi: The Beautiful World"
## [3] "Texhnolyze"                        "Touch"                            
## [5] "Gintama"                          
## 
## $`2`
## [1] "Kuroko no Basket 2nd Season" "Kuroko no Basket 3rd Season"
## [3] "Tasogare Otome x Amnesia"    "Golden Time"                
## [5] "Ookami to Koushinryou"      
## 
## $`3`
## [1] "Haikyuu!!: Karasuno Koukou VS Shiratorizawa Gakuen Koukou"
## [2] "Detective Conan Movie 13: The Raven Chaser"               
## [3] "Pokemon: Mewtwo no Gyakushuu"                             
## [4] "Mushishi"                                                 
## [5] "Mushishi Special: Hihamukage"                             
## 
## $`4`
## [1] "Dragon Ball Z"                "Street Fighter II: The Movie"
## [3] "Dragon Ball GT"               "Strait Jacket"               
## [5] "Mimi wo Sumaseba"            
## 
## $`5`
## [1] "Saiunkoku Monogatari"                                
## [2] "Angel Beats!: Another Epilogue"                      
## [3] "Kino no Tabi: The Beautiful World"                   
## [4] "Kino no Tabi: Nanika wo Suru Tame ni - Life Goes On."
## [5] "Planetes"
#check the accuracy 
results <- evaluate(scheme, method = "UBCF", type = "ratings")
## UBCF run fold/sample [model time/prediction time]
##   1  [0.03sec/318.94sec]
avg(results)
##          RMSE      MSE      MAE
## [1,] 1.579522 2.494889 1.212381
results_2 <- evaluate(scheme, 
                    method = "UBCF",
                    type = "topNList", 
                    n = c(5, 10, 20, 30, 40))
## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/307.37sec]
avg(results_2)
##             TP        FP       FN       TN       N  precision      recall
## [1,] 0.1967930  4.744898 95.93294 7826.465 7927.34 0.03982301 0.001935160
## [2,] 0.4125364  9.470845 95.71720 7821.739 7927.34 0.04174041 0.003984785
## [3,] 0.8090379 18.957726 95.32070 7812.252 7927.34 0.04092920 0.008059896
## [4,] 1.2871720 28.362974 94.84257 7802.847 7927.34 0.04341200 0.012854015
## [5,] 1.7689504 37.764577 94.36079 7793.445 7927.34 0.04474558 0.017576514
##              TPR          FPR  n
## [1,] 0.001935160 0.0006056703  5
## [2,] 0.003984785 0.0012089195 10
## [3,] 0.008059896 0.0024199557 20
## [4,] 0.012854015 0.0036203765 30
## [5,] 0.017576514 0.0048202719 40

UBCF takes a long time to run due to the size of the data set. The RMSE and MSE are 1.58 and 2.5; RMSE is slightly higher than CBF. Precision and recall both outperform CBF.

In practice, it is a good idea to optimize the neighborhood size, but in this case, I will keep the default settings because the algorithm already takes a long time to run and the purpose of this project is to see how adding other elements, like diversity, affect accuracy.

SVD

set.seed(1122)
results_3 <- evaluate(scheme, method = "SVD", type = "ratings")
## SVD run fold/sample [model time/prediction time]
##   1  [4.33sec/1.87sec]
avg(results_3)
##          RMSE      MSE       MAE
## [1,] 1.303997 1.700408 0.9808316
results_4 <- evaluate(scheme, 
                    method = "SVD",
                    type = "topNList", 
                    n = c(5, 10, 20, 30, 40))
## SVD run fold/sample [model time/prediction time]
##   1  [4.11sec/2.81sec]
#check the results
avg(results_4)
##              TP        FP       FN       TN       N    precision       recall
## [1,] 0.00000000  4.941691 96.12974 7826.268 7927.34 0.0000000000 0.0000000000
## [2,] 0.00000000  9.883382 96.12974 7821.327 7927.34 0.0000000000 0.0000000000
## [3,] 0.00000000 19.766764 96.12974 7811.443 7927.34 0.0000000000 0.0000000000
## [4,] 0.01239067 29.637755 96.11735 7801.572 7927.34 0.0004178958 0.0001653730
## [5,] 0.04373178 39.489796 96.08601 7791.720 7927.34 0.0011061947 0.0003892627
##               TPR          FPR  n
## [1,] 0.0000000000 0.0006312848  5
## [2,] 0.0000000000 0.0012625695 10
## [3,] 0.0000000000 0.0025251390 20
## [4,] 0.0001653730 0.0037861008 30
## [5,] 0.0003892627 0.0050445864 40

RMSE, etc. are actually better than UBCF, but precision and recall are much worse. Let’s find a better k, if possible:

k_to_test <- c(5, 10, 20, 30, 40, 50)

set.seed(1122)
results_by_k <- lapply(k_to_test, function(k) {
  evaluate(scheme,
           method = "SVD",
           type   = "topNList",
           n      = c(5, 10, 20, 30, 40),
           parameter = list(k = k))
})
## SVD run fold/sample [model time/prediction time]
##   1  [2.88sec/2.69sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [4.16sec/2.8sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [6.16sec/2.91sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [8.61sec/2.88sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [10.25sec/3.17sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [13.03sec/3.07sec]
# check results
avg(results_by_k[[1]])
##              TP        FP       FN       TN       N    precision       recall
## [1,] 0.00000000  4.941691 96.12974 7826.268 7927.34 0.0000000000 0.0000000000
## [2,] 0.00000000  9.883382 96.12974 7821.327 7927.34 0.0000000000 0.0000000000
## [3,] 0.00000000 19.766764 96.12974 7811.443 7927.34 0.0000000000 0.0000000000
## [4,] 0.01239067 29.637755 96.11735 7801.572 7927.34 0.0004178958 0.0001653730
## [5,] 0.04154519 39.491983 96.08819 7791.718 7927.34 0.0010508850 0.0003679217
##               TPR          FPR  n
## [1,] 0.0000000000 0.0006312848  5
## [2,] 0.0000000000 0.0012625695 10
## [3,] 0.0000000000 0.0025251390 20
## [4,] 0.0001653730 0.0037861008 30
## [5,] 0.0003679217 0.0050448626 40
# Precision data frame
prec_df_svd <- do.call(rbind, lapply(seq_along(k_to_test), function(i) {
  a <- avg(results_by_k[[i]])
  data.frame(k         = k_to_test[i],
             n         = a[, "n"],
             precision = a[, "precision"])
}))

ggplot(prec_df_svd, aes(x = n, y = precision, color = factor(k), group = factor(k))) +
  geom_line() + geom_point() +
  labs(title = "SVD Precision by Dimensionality (k) and Number of Suggestions",
       x = "Number of Suggestions", y = "Precision",
       color = "Dimensionality (k)") +
  theme_minimal()

# TPR data frame
tpr_df_svd <- do.call(rbind, lapply(seq_along(k_to_test), function(i) {
  a <- avg(results_by_k[[i]])
  data.frame(k   = k_to_test[i],
             n   = a[, "n"],
             TPR = a[, "TPR"])
}))

ggplot(tpr_df_svd, aes(x = n, y = TPR, color = factor(k), group = factor(k))) +
  geom_line() + geom_point() +
  labs(title = "SVD TPR (recall) by Dimensionality (k) and Number of Suggestions",
       x = "Number of Suggestions", y = "TPR (recall)",
       color = "Dimensionality (k)") +
  theme_minimal()

Increasing dimensionality does not result in dramatic improvement. Some of these numbers are very similar, which is why they’re not showing up in the graph (overlap). For TPR, 40 and 50 have the same trajectory, 5 and 10 are similar. 50 dimensions does the best overall, but the next-best is 30, which is close, and takes less long to run.

Tested to see if this was a cold start problem - I tried this with a larger given (35) separately, and it didn’t result in much improvement.

set.seed(1122)
results_5 <- evaluate(scheme, method = "SVD", type = "ratings", 
                    parameter = list(k = 30))
## SVD run fold/sample [model time/prediction time]
##   1  [7.39sec/2.09sec]
avg(results_5)
##          RMSE      MSE       MAE
## [1,] 1.302501 1.696509 0.9794609
results_6 <- evaluate(scheme, 
                    method = "SVD",
                    type = "topNList", 
                    n = c(5, 10, 20, 30, 40), 
                    parameter = list(k = 30))
## SVD run fold/sample [model time/prediction time]
##   1  [8.72sec/3.14sec]
#check the results
avg(results_6)
##               TP        FP       FN       TN       N    precision       recall
## [1,] 0.000000000  4.941691 96.12974 7826.268 7927.34 0.000000e+00 0.000000e+00
## [2,] 0.000000000  9.883382 96.12974 7821.327 7927.34 0.000000e+00 0.000000e+00
## [3,] 0.000728863 19.766035 96.12901 7811.444 7927.34 3.687316e-05 3.816037e-06
## [4,] 0.013119534 29.637026 96.11662 7801.573 7927.34 4.424779e-04 1.691890e-04
## [5,] 0.046647230 39.486880 96.08309 7791.723 7927.34 1.179941e-03 4.205039e-04
##               TPR          FPR  n
## [1,] 0.000000e+00 0.0006312848  5
## [2,] 0.000000e+00 0.0012625695 10
## [3,] 3.816037e-06 0.0025250445 20
## [4,] 1.691890e-04 0.0037860062 30
## [5,] 4.205039e-04 0.0050442072 40

Random: establishing a floor

rec4 <- Recommender(getData(scheme, "train"), method = "RANDOM")

results_r <- evaluate(scheme, method = "RANDOM", type = "ratings")
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.01sec/1sec]
avg(results_r)
##          RMSE      MSE      MAE
## [1,] 3.803858 14.46934 3.110573
results_r2 <- evaluate(scheme, 
                    method = "RANDOM",
                    type = "topNList", 
                    n = c(5, 10, 20, 30, 40))
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.01sec/2.42sec]
avg(results_r2)
##              TP        FP       FN       TN       N  precision       recall
## [1,] 0.06413994  4.935860 96.06560 7826.274 7927.34 0.01282799 0.0005932556
## [2,] 0.12609329  9.873907 96.00364 7821.336 7927.34 0.01260933 0.0013445853
## [3,] 0.25437318 19.745627 95.87536 7811.464 7927.34 0.01271866 0.0026898599
## [4,] 0.38192420 29.618076 95.74781 7801.592 7927.34 0.01273081 0.0039833604
## [5,] 0.49052478 39.509475 95.63921 7791.700 7927.34 0.01226312 0.0049503555
##               TPR          FPR  n
## [1,] 0.0005932556 0.0006302935  5
## [2,] 0.0013445853 0.0012609300 10
## [3,] 0.0026898599 0.0025215154 20
## [4,] 0.0039833604 0.0037821578 30
## [5,] 0.0049503555 0.0050452918 40
#not displaying the recs for 6 users because they won't tell us anything

RMSE is really high at ~3.8 (higher than everything else), precision and recall are better than SVD, which is concerning for SVD.

Comparing results

Looking at UBCF, SVD, popular, and Random for the first comparison. I will add CBF’s more limited metrics after.

extract_avg <- function(result, method_name) {
  as.data.frame(avg(result)) |>
    mutate(Method = method_name)
}


combined_df <- bind_rows(
  extract_avg(results_2,       "UBCF"),   
  extract_avg(results_6,               "SVD (k=30)"),
    extract_avg(results_p2,              "Popular"),
  extract_avg(results_r2,              "Random")
)

combined_df
##             TP        FP       FN       TN       N    precision       recall
## 1  0.196793003  4.744898 95.93294 7826.465 7927.34 3.982301e-02 1.935160e-03
## 2  0.412536443  9.470845 95.71720 7821.739 7927.34 4.174041e-02 3.984785e-03
## 3  0.809037901 18.957726 95.32070 7812.252 7927.34 4.092920e-02 8.059896e-03
## 4  1.287172012 28.362974 94.84257 7802.847 7927.34 4.341200e-02 1.285402e-02
## 5  1.768950437 37.764577 94.36079 7793.445 7927.34 4.474558e-02 1.757651e-02
## 6  0.000000000  4.941691 96.12974 7826.268 7927.34 0.000000e+00 0.000000e+00
## 7  0.000000000  9.883382 96.12974 7821.327 7927.34 0.000000e+00 0.000000e+00
## 8  0.000728863 19.766035 96.12901 7811.444 7927.34 3.687316e-05 3.816037e-06
## 9  0.013119534 29.637026 96.11662 7801.573 7927.34 4.424779e-04 1.691890e-04
## 10 0.046647230 39.486880 96.08309 7791.723 7927.34 1.179941e-03 4.205039e-04
## 11 1.943877551  2.997813 94.18586 7828.212 7927.34 3.933628e-01 3.146229e-02
## 12 3.414723032  6.468659 92.71501 7824.741 7927.34 3.455015e-01 5.522792e-02
## 13 6.124635569 13.642128 90.00510 7817.568 7927.34 3.098451e-01 9.301004e-02
## 14 7.968658892 21.681487 88.16108 7809.528 7927.34 2.687561e-01 1.167602e-01
## 15 9.844752187 29.688776 86.28499 7801.521 7927.34 2.490229e-01 1.422973e-01
## 16 0.064139942  4.935860 96.06560 7826.274 7927.34 1.282799e-02 5.932556e-04
## 17 0.126093294  9.873907 96.00364 7821.336 7927.34 1.260933e-02 1.344585e-03
## 18 0.254373178 19.745627 95.87536 7811.464 7927.34 1.271866e-02 2.689860e-03
## 19 0.381924198 29.618076 95.74781 7801.592 7927.34 1.273081e-02 3.983360e-03
## 20 0.490524781 39.509475 95.63921 7791.700 7927.34 1.226312e-02 4.950356e-03
##             TPR          FPR  n     Method
## 1  1.935160e-03 0.0006056703  5       UBCF
## 2  3.984785e-03 0.0012089195 10       UBCF
## 3  8.059896e-03 0.0024199557 20       UBCF
## 4  1.285402e-02 0.0036203765 30       UBCF
## 5  1.757651e-02 0.0048202719 40       UBCF
## 6  0.000000e+00 0.0006312848  5 SVD (k=30)
## 7  0.000000e+00 0.0012625695 10 SVD (k=30)
## 8  3.816037e-06 0.0025250445 20 SVD (k=30)
## 9  1.691890e-04 0.0037860062 30 SVD (k=30)
## 10 4.205039e-04 0.0050442072 40 SVD (k=30)
## 11 3.146229e-02 0.0003815112  5    Popular
## 12 5.522792e-02 0.0008237830 10    Popular
## 13 9.301004e-02 0.0017372958 20    Popular
## 14 1.167602e-01 0.0027618490 30    Popular
## 15 1.422973e-01 0.0037824240 40    Popular
## 16 5.932556e-04 0.0006302935  5     Random
## 17 1.344585e-03 0.0012609300 10     Random
## 18 2.689860e-03 0.0025215154 20     Random
## 19 3.983360e-03 0.0037821578 30     Random
## 20 4.950356e-03 0.0050452918 40     Random
#Comparison at 30 rec's

combined_df |>
  filter(n == 30) |>
  select(Method, precision, recall, TPR, FPR) |>
  knitr::kable(digits = 4, caption = "Comparison at 30 Recommendations")
Comparison at 30 Recommendations
Method precision recall TPR FPR
UBCF 0.0434 0.0129 0.0129 0.0036
SVD (k=30) 0.0004 0.0002 0.0002 0.0038
Popular 0.2688 0.1168 0.1168 0.0028
Random 0.0127 0.0040 0.0040 0.0038
#bar chart
combined_df |>
  filter(n == 30) |>
  select(Method, precision, recall) |>
  pivot_longer(c(precision, recall), names_to = "Metric", values_to = "Value") |>
  ggplot(aes(x = Method, y = Value, fill = Metric)) +
  geom_col(position = "dodge") +
  labs(title = "Precision & Recall at N=30", x = NULL, y = NULL) +
  theme_minimal()

SVD is so low it doesn’t even show up on the bar chart. UBCF does significantly better than random, but popular is still a high baseline.

#Precision

ggplot(combined_df, aes(x = n, y = precision, color = Method)) +
  geom_line(linewidth = 1) + geom_point() +
  labs(title = "Precision by Method",
       x = "Number of Recommendations", y = "Precision") +
  theme_minimal()

#Reall/true positive rate

ggplot(combined_df, aes(x = n, y = TPR, color = Method)) +
  geom_line(linewidth = 1) + geom_point() +
  labs(title = "Recall (TPR) by Method",
       x = "Number of Recommendations", y = "Recall (TPR)") +
  theme_minimal()

#ROC curve

ggplot(combined_df, aes(x = FPR, y = TPR, color = Method)) +
  geom_line(linewidth = 1) + geom_point(shape = 3) +
  labs(title = "ROC Curve by Method",
       x = "FPR", y = "TPR") +
  theme_minimal()

Based on the ROC curve, SVD does worse than random, possibly because the data set is so sparse, and there are only a few false positives and true positives overall. With a large rating scale, 1-10, SVD may also struggle to normalize.

Popular does better in every case, but UBCF improves with more recommendations and is better than random.

Adding in collaborative filtering (there’s only one k, 30, for this model):

cbf_row <- data.frame(
  Method = "CBF",
  n = 30,
  precision = mean(precision_list, na.rm = TRUE),
  recall = mean(recall_list, na.rm = TRUE)
)

combined_df2 <- bind_rows(
  extract_avg(results_2,  "UBCF"),
  extract_avg(results_6,  "SVD (k=30)"),
  extract_avg(results_p2, "Popular"),
  extract_avg(results_r2, "Random")
        ) |>
  filter(n == 30) |>
  select(Method, n, precision, recall) |>
  bind_rows(cbf_row)

#bar chart

combined_df2 |>
  filter(n == 30) |>
  select(Method, precision, recall) |>
  mutate(Method = reorder(Method, -precision)) |>
  pivot_longer(c(precision, recall), names_to = "Metric", values_to = "Value") |>
  ggplot(aes(x = Method, y = Value, fill = Metric)) +
  geom_col(position = "dodge") +
  labs(title = "Precision & Recall at 30 Suggestions", x = NULL, y = NULL) +
  theme_minimal()

CBF’s precision and recall were worse than everything but SVD. However, that’s not true for RMSE:

rmse_df <- data.frame(
  Algorithm = c("UBCF", "SVD", "Popular", "Random", "CBF"),
  RMSE = c(
    avg(results)[,"RMSE"],
    avg(results_5)[,"RMSE"],
    avg(results_p)[,"RMSE"],
    avg(results_r)[,"RMSE"],
    mean(rmse_list, na.rm = TRUE)
  )
)

rmse_df
##   Algorithm     RMSE
## 1      UBCF 1.579522
## 2       SVD 1.302501
## 3   Popular 1.283353
## 4    Random 3.803858
## 5       CBF 1.550275
rmse_df |> arrange(desc(RMSE)) |>
  ggplot(aes(x = reorder(Algorithm, -RMSE), y = RMSE, fill = Algorithm)) +
  geom_col() + 
  labs(title = "Algorithm by RMSE (at 30 Recommendations)",
       x = "Algorithm") +
  theme_minimal() +
  theme(legend.position = "none")

CF is in the middle for RMSE. Precision and recall might not be great, but the model is able to predict scores better than UBCF. SVD and popular (popular items score well) are close, while UBCF is slightly worse, and random is very high. SVD’s RMSE comes close, so, again, while the model is bad at picking true positives (choosing items the user has watched) it’s good at making accurate predictions on the ones it does choose. It may be better at helping users discover items they haven’t explored yet.

However, because SVD’s precision and recall were so low, I’m only going to modify UBCF and CBF to include other types of recommendations.

Adding recommendation diversity

I’ll add recommendation diversity in two ways . First, I’m adding a random element to the UBCF recommender (20% random), which should give the results variety but make the metrics slightly worse. I’m using a smaller set of 800 users, since the hybrid recommender took a long time to run (~20 minutes).

UBCF + Random

# sample 800 users
set.seed(1122)
user_sample <- sample(1:nrow(real_anime_matrix20), 800)
small_matrix <- real_anime_matrix20[user_sample, ]

# train/test split on 800 users
scheme_small <- evaluationScheme(small_matrix,
                                 method = "split",
                                 train = 0.8,
                                 given = 15,
                                 goodRating = 6)

test_data    <- getData(scheme_small, "known")
test_unknown <- getData(scheme_small, "unknown")

#retrain models on small scheme
rec1_s <- Recommender(getData(scheme_small,
                              "train"), 
                      method = "UBCF")


rec4_s <- Recommender(getData(scheme_small,
                              "train"),
                      method = "RANDOM")

#adding 20% random suggestions for diversity
diversity <- HybridRecommender(
  rec1_s, #UBCF
  rec4_s, #random
  weights = c(0.8, 0.2)
)
preds_div <- predict(diversity, test_data, type = "ratings")

#check accuracy (RMSE MAE MSE)
acc <- calcPredictionAccuracy(preds_div, test_unknown)

acc
##     RMSE      MSE      MAE 
## 2.720935 7.403486 2.077912
preds_topn <- predict(diversity, 
                      test_data,
                      type = "topNList",
                      n = 30)


#check true positives, etc.
acc_topn <- calcPredictionAccuracy(preds_topn,
                                   test_unknown,
                                   given = 15,
                                   goodRating = 6)

acc_topn
##           TP           FP           FN           TN            N    precision 
## 3.687500e-01 2.963125e+01 1.103375e+02 7.786663e+03 7.927000e+03 1.229167e-02 
##       recall          TPR          FPR 
## 3.228892e-03 3.228892e-03 3.790708e-03

Adding random recommendations made precision, recall, and RMSE a lot worse, predictably (this may also have to do with the smaller user set, but I was able to run this once on the original data set and the result was similar).

Combining CBF with diversity

This uses Maximal Marginal Relevance, which re-ranks predictions based on similarity to items it already predicted. It picks the top n (30) items, but one by one, balancing the list based on similarity to previously selected itms. A lambda of .7 means 70% relevance, 30% avoiding redundant/similar choices.

# use cosine similarity previously calculated (no need to calculate it again)

cbf_predict_mmr <- function(user_ratings, item_sim, n = 30, lambda = 0.7) {
  rated_idx <- which(!is.na(user_ratings))
  unrated_idx <- which(is.na(user_ratings))
  
  #base relevance scores
  scores <- sapply(unrated_idx, function(i) {
    sims <- item_sim[i, rated_idx]
    rats <- user_ratings[rated_idx]
    sum(sims * rats) / (sum(abs(sims)) + 1e-9)
  })
  
  #MMR reranking
  n_candidates <- length(unrated_idx)
  selected_pos <- c()
  remaining <- seq_len(n_candidates)
  
  for (k in 1:n) {
    if (length(selected_pos) == 0) {
      best_pos <- which.max(scores[remaining])
    } else {
      redundancy <- apply(item_sim[unrated_idx[remaining], unrated_idx[selected_pos], drop = FALSE], 1, max)
      mmr <- lambda * scores[remaining] - (1 - lambda) * redundancy
      best_pos <- which.max(mmr)
    }
    selected_pos <- c(selected_pos, remaining[best_pos])
    remaining <- remaining[-best_pos]
  }
  
  result <- scores[selected_pos]
  names(result) <- names(user_ratings)[unrated_idx[selected_pos]]
  result
}


set.seed(1122)
test_users <- sample(1:n_users, 200)
precision_list_cd <- c()
recall_list_cd <- c()
rmse_list_cd <- c()

for (i in test_users) {
  user_ratings <- rating_m[i, ]
  rated_idx <- which(!is.na(user_ratings))
  if (length(rated_idx) < 5) next
  test_idx <- sample(rated_idx, max(1, floor(0.2 * length(rated_idx))))
  train_ratings <- user_ratings
  train_ratings[test_idx] <- NA
  preds <- cbf_predict_mmr(train_ratings, item_sim, n = 30, lambda = 0.7)
  pred_items <- names(preds)
  relevant <- names(user_ratings[test_idx][user_ratings[test_idx] >= 6])
  tp <- length(intersect(pred_items, relevant))
  precision_list_cd[i] <- tp / length(pred_items)
  recall_list_cd[i]    <- ifelse(length(relevant) > 0, tp / length(relevant), NA)
  common <- intersect(pred_items, names(user_ratings[test_idx]))
  if (length(common) > 0) {
    actual <- user_ratings[test_idx][common]
    predicted <- preds[common]
    rmse_list_cd[i] <- sqrt(mean((actual - predicted)^2))
  }
}

cat("Precision@30:", mean(precision_list_cd, na.rm = TRUE), "\n")
## Precision@30: 0.003333333
cat("Recall@30:   ", mean(recall_list_cd, na.rm = TRUE), "\n")
## Recall@30:    0.006107851
cat("RMSE:        ", mean(rmse_list_cd, na.rm = TRUE), "\n")
## RMSE:         1.420619

RMSE is actually a little better than the original model, as are precision and recall. This could be because alternating between genres give the model the option make predictions from different pools. So, in a situation where each genre would have a bunch of top options that many people have seen, diversifying predictions may make it easier to predict ratings.

Comparing results with novelty and diversity

rmse_df_hybrid <- data.frame(
  Algorithm = c("UBCF", "SVD", "Popular", "Random", 
                "CBF", "UBCF-Random", 
                "UBCF-Popular", "CBF Diversity"),
  RMSE = c(
    avg(results)[,"RMSE"],
    avg(results_5)[,"RMSE"],
    avg(results_p)[,"RMSE"],
    avg(results_r)[,"RMSE"],
    mean(rmse_list, na.rm = TRUE),
    acc["RMSE"],
    acc2["RMSE"],
    mean(rmse_list_cd, na.rm = TRUE)
  )
)

rmse_df_hybrid
##       Algorithm     RMSE
## 1          UBCF 1.579522
## 2           SVD 1.302501
## 3       Popular 1.283353
## 4        Random 3.803858
## 5           CBF 1.550275
## 6   UBCF-Random 2.720935
## 7  UBCF-Popular 1.400226
## 8 CBF Diversity 1.420619
rmse_df_hybrid |> arrange(desc(RMSE)) |>
  ggplot(aes(x = reorder(Algorithm, -RMSE), y = RMSE, fill = Algorithm)) +
  geom_col() + 
  labs(title = "Algorithm by RMSE (at 30 Recommendations)",
       x = "Algorithm") +
  theme_minimal() + 
  theme(legend.position = "none")

Popular still has the lowest RMSE, but UBCF-Popular comes close, and CBF with diversity performs better than the original CBF algorithm.

cbf_row2 <- data.frame(
  Method = "CBF Diversity",
  n = 30,
  precision = mean(precision_list_cd, na.rm = TRUE),
  recall = mean(recall_list_cd, na.rm = TRUE)
)

#add cbf and make data frames that can be added to the table

hybrid1_row <- data.frame(
  Method    = "UBCF-Random",
  n         = 30,
  precision = acc_topn["precision"],
  recall    = acc_topn["recall"]
)

hybrid2_row <- data.frame(
  Method    = "UBCF-Popular",
  n         = 30,
  precision = acc_topn2["precision"],
  recall    = acc_topn2["recall"]
)

combined_df3 <- combined_df2 |>
  bind_rows(hybrid1_row, hybrid2_row) |>
  select(Method, n, precision, recall) |>
  bind_rows(cbf_row2)

combined_df3
##                      Method  n    precision       recall
## ...1                   UBCF 30 0.0434119961 0.0128540152
## ...2             SVD (k=30) 30 0.0004424779 0.0001691890
## ...3                Popular 30 0.2687561455 0.1167601506
## ...4                 Random 30 0.0127308066 0.0039833604
## ...5                    CBF 30 0.0023333333 0.0033476662
## precision...6   UBCF-Random 30 0.0122916667 0.0032288916
## precision...7  UBCF-Popular 30 0.0026315789 0.0003707726
## ...8          CBF Diversity 30 0.0033333333 0.0061078514
#bar chart

combined_df3 |>
  filter(n == 30) |>
  select(Method, precision, recall) |>
  mutate(Method = reorder(Method, -precision)) |>
  pivot_longer(c(precision, recall), names_to = "Metric", values_to = "Value") |>
  ggplot(aes(x = Method, y = Value, fill = Metric)) +
  geom_col(position = "dodge") +
  labs(title = "Precision & Recall at 30 Suggestions", x = NULL, y = NULL) +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45))

For both UBCF hybrid models, precision and recall are significantly worse. For CBF as a hybrid recommender, recall improved, but not precision.

Conclusion

Having CBF predict a more diverse set of items resulted in a lower RMSE. Recall and precision also improved slightly, which suggests that diversifying genre recommendations allows the model to find more of the relevant cases. This test was only run on a pool of 200 users, so, it’s possible that if run on the entire data set, it could perform even better. CBF + diversity also incorporates a more sophisticated model than mixing UBCF with random or popular recommendations.

As predicted, adding random recommendations to the UBCF algorithm made the RMSE, precision, and recall worse. Having a bunch of random suggestions thrown in probably does not improve the user experience for most people. Recommending popular items, while it improves RMSE in testing, might not add a lot of value for many users, since they likely already know about these items. It’s almost like giving the model a cheat sheet of what most users have watched. Precision and recall were also worse, possibly because UBCF allowed the model to identify most popular items through user filtering, and it ended up picking from less popular items than the pure popular model.

For a hybrid recommender, I would pick CBF-diversity. It takes user preferences and builds in different genre recommendations based on their interests. That seems like a better experience than being recommended all Gundam anime because that’s a large percentage of what a user watched or what they generally rated better.

UX tests

Some possible UX experiments:

Whether providing reasoning for the recommendation could be helpful, e.g., “Because you watched x, you might enjoy y.” This could be evaluated using a/b testing, where some users see recommendations + reasoning, while others just see recommendations without context. Are the users who see the reasoning behind their recommendations more likely to click through?

Provide an option to choose favorite shows, so recommendations can be more aligned with those (Letterboxd does a version of this). To evaluate how well this works, keep 50% of recommendations aligned with favorite shows, 50% algined with other user preferences, and see which users engage with more. Or, pilot on only on a portion of users and evaluate.

Sometimes, people are in the mood for a certain type of show. There could be an interactive element, like asking the user, “what do you feel like watching?” Then, they choose a genre and other specifiers, like whether it’s a movie or series, or how long it is. Then, evaluate whether those recommendations that incorporate additional specifications were more successful than passive ones.

I added popular recommendations to UBCF, but this doesn’t help users stay on top of what’s trending. Those recommendations are all-time, not what’s popular this week, month, etc. This could be implemented via a new/everyone’s watching feature (novelty + popularity) for the week or month, perhaps on the home page or alongside personalized recommendations, and testing how much users clicked on those or started watching compared to their personalized recommendations.

Considering engagement outside of ratings: Measuring whether the user clicked and visited the page for the show, whether they engaged partially (e.g., watched a few episodes and then stopped), or whether they added the show to their watch list. This gives clues as to what the user is interested in, and could be seen as a successful recommendation. Series take a long time to watch, and they may come back to a recommendation later.

Sources

Claude Sonnet, 4.6. [Large language model] Accessed June 2026. Claude.ai