Project 2

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(recommenderlab)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loading required package: arules
## 
## Attaching package: 'arules'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
## 
## Loading required package: proxy
## 
## Attaching package: 'proxy'
## 
## The following object is masked from 'package:Matrix':
## 
##     as.matrix
## 
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## 
## The following object is masked from 'package:base':
## 
##     as.matrix

#load MovieLensMeta
data("MovieLense")

#this df contains item information (boolean yes or no whether something has
#an attribute)
glimpse(MovieLenseMeta)

## Rows: 1,664
## Columns: 22
## $ title        <chr> "Toy Story (1995)", "GoldenEye (1995)", "Four Rooms (1995…
## $ year         <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 199…
## $ url          <chr> "http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)", …
## $ unknown      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Action       <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Adventure    <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Animation    <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `Children's` <int> 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Comedy       <int> 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, …
## $ Crime        <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, …
## $ Documentary  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Drama        <int> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, …
## $ Fantasy      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `Film-Noir`  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Horror       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ Musical      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Mystery      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Romance      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, …
## $ `Sci-Fi`     <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Thriller     <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, …
## $ War          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Western      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

#user information (age, occupatoin, etc.)
glimpse(MovieLenseUser)

## Rows: 943
## Columns: 5
## $ id         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ age        <int> 24, 53, 23, 24, 33, 42, 57, 36, 29, 53, 39, 28, 47, 45, 49,…
## $ sex        <fct> M, F, M, M, F, M, M, M, M, M, F, F, M, M, F, M, M, F, M, F,…
## $ occupation <fct> technician, other, writer, technician, other, executive, ad…
## $ zipcode    <fct> 85711, 94043, 32067, 43537, 15213, 98101, 91344, 05201, 010…

#the main data set

MovieLense

## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.

Exploratory analysis

#convert the matrix into a data frame (long) for exploratory analysis 

ratings_df <- as(MovieLense, "data.frame")

ratings_df_wide <- as.data.frame(as(MovieLense, "matrix"))

Most-rated and top-rated movies

What movies have the most ratings? What movies are the best-rated? How many movies are represented overall?

top_ratings <- ratings_df |> 
  group_by(item) |>
  summarise(avg_rating = mean(rating, na.rm = TRUE)) |>
  arrange(desc(avg_rating))

#some of these movies have probably only been rated once
top_ratings

## # A tibble: 1,664 × 2
##    item                                                avg_rating
##    <chr>                                                    <dbl>
##  1 "Aiqing wansui (1994)"                                       5
##  2 "Entertaining Angels: The Dorothy Day Story (1996)"          5
##  3 "Great Day in Harlem, A (1994)"                              5
##  4 "Marlene Dietrich: Shadow and Light (1996) "                 5
##  5 "Prefontaine (1997)"                                         5
##  6 "Saint of Fort Washington, The (1993)"                       5
##  7 "Santa with Muscles (1996)"                                  5
##  8 "Someone Else's America (1995)"                              5
##  9 "Star Kid (1997)"                                            5
## 10 "They Made Me a Criminal (1939)"                             5
## # ℹ 1,654 more rows

most_ratings <- ratings_df |>
  group_by(item) |>
  summarise(count_rating = n(),
            mean_rating  = mean(rating, na.rm = TRUE)) |>
  arrange(desc(count_rating))

#the original star wars is the most-rated movie
most_ratings

## # A tibble: 1,664 × 3
##    item                          count_rating mean_rating
##    <chr>                                <int>       <dbl>
##  1 Star Wars (1977)                       583        4.36
##  2 Contact (1997)                         509        3.80
##  3 Fargo (1996)                           508        4.16
##  4 Return of the Jedi (1983)              507        4.01
##  5 Liar Liar (1997)                       485        3.16
##  6 English Patient, The (1996)            481        3.66
##  7 Scream (1996)                          478        3.44
##  8 Toy Story (1995)                       452        3.88
##  9 Air Force One (1997)                   431        3.63
## 10 Independence Day (ID4) (1996)          429        3.44
## # ℹ 1,654 more rows

highest_rated <- most_ratings |>
  filter(count_rating > 50) |>
  arrange(desc(mean_rating))

#top-rated movies with more than 50 reviews
head(highest_rated)

## # A tibble: 6 × 3
##   item                                                  count_rating mean_rating
##   <chr>                                                        <int>       <dbl>
## 1 Close Shave, A (1995)                                          112        4.49
## 2 Schindler's List (1993)                                        298        4.47
## 3 Wrong Trousers, The (1993)                                     118        4.47
## 4 Casablanca (1942)                                              243        4.46
## 5 Wallace & Gromit: The Best of Aardman Animation (199…           67        4.45
## 6 Shawshank Redemption, The (1994)                               283        4.45

#1664 unique movies are represented
count(most_ratings)

## # A tibble: 1 × 1
##       n
##   <int>
## 1  1664

#how many moview have fewer than 5 ratings?
most_ratings |> filter(count_rating < 5) |>
  arrange(count_rating)

## # A tibble: 324 × 3
##    item                               count_rating mean_rating
##    <chr>                                     <int>       <dbl>
##  1 A koldum klaka (Cold Fever) (1994)            1           3
##  2 Aiqing wansui (1994)                          1           5
##  3 All Things Fair (1996)                        1           3
##  4 Angel on My Shoulder (1946)                   1           2
##  5 Angela (1995)                                 1           3
##  6 August (1996)                                 1           1
##  7 B. Monkey (1998)                              1           3
##  8 Baton Rouge (1988)                            1           1
##  9 Big Bang Theory, The (1994)                   1           4
## 10 Big One, The (1997)                           1           3
## # ℹ 314 more rows

#324 movies

#how many have only one? 136 or about 8 percent.
most_ratings |> filter(count_rating < 2) |>
  arrange(count_rating)

## # A tibble: 136 × 3
##    item                               count_rating mean_rating
##    <chr>                                     <int>       <dbl>
##  1 A koldum klaka (Cold Fever) (1994)            1           3
##  2 Aiqing wansui (1994)                          1           5
##  3 All Things Fair (1996)                        1           3
##  4 Angel on My Shoulder (1946)                   1           2
##  5 Angela (1995)                                 1           3
##  6 August (1996)                                 1           1
##  7 B. Monkey (1998)                              1           3
##  8 Baton Rouge (1988)                            1           1
##  9 Big Bang Theory, The (1994)                   1           4
## 10 Big One, The (1997)                           1           3
## # ℹ 126 more rows

#what is the mean rating? About 3.53
ratings_df |> summarise(mean(rating, na.rm = TRUE))

##   mean(rating, na.rm = TRUE)
## 1                   3.529982

Number of ratings by user

Looking at the number of users represented and a histogram of reviews per user:

rating_counts <- rowSums(!is.na(ratings_df_wide))

rating_counts <- as_tibble(rating_counts)

#average number of ratings per user = 105
rating_counts |> summarize(mean(value))

## # A tibble: 1 × 1
##   `mean(value)`
##           <dbl>
## 1          105.

#max is 735, so at least one user rated 735 films
rating_counts |> summarize(max(value))

## # A tibble: 1 × 1
##   `max(value)`
##          <dbl>
## 1          735

rating_counts |> ggplot(aes(x = value)) +
  geom_histogram(bins = 40) +
  labs(title = "Number of Ratings by User", x = "number of ratings") +
  theme_minimal()

The distribution is left-skewed, predictably (you can’t review fewer than 0 movies). There are a some outliers (users who reviewed over 600 movies). A large number of users reviewed fewer than 80 movies.

How many users posted zero ratings?

#how many users rated at least one movie?
rating_counts |> count(value != 0)

## # A tibble: 1 × 2
##   `value != 0`     n
##   <lgl>        <int>
## 1 TRUE           943

#how many raters total
count(rating_counts)

## # A tibble: 1 × 1
##       n
##   <int>
## 1   943

#minimum value
rating_counts |> summarize(min(value))

## # A tibble: 1 × 1
##   `min(value)`
##          <dbl>
## 1           19

The data set only includes users who rated at least 19 movies (no one rated zero movies).

Evaluating different recommendations systems

Minimum number of ratings

I will start with users with only a certain number of ratings: 40. This isn’t too far off from the minimum, 19, but allows us to let the recommender “see” 35 ratings for each user.

With 633 users left, we can split the data into 506 (training) and 127 (test) 80/20.

MovieLense40 <- MovieLense[rowCounts(MovieLense) > 40, ]

#how many users do we have left?
MovieLense40

## 633 x 1664 rating matrix of class 'realRatingMatrix' with 90953 ratings.

Normalizing the ratings

Normalizing the data - there are two ways in recommenderlab: Z-score and centering. When passed through Recommender, the data is normalized automatically, and it’s centered by default.

Centering makes the average 0. It keeps the original score of the numbers. This is better for critical vs non-critical reviewers.
Z-score divides by the standard deviation.

https://www.rdocumentation.org/packages/recommenderlab/versions/1.0.6/topics/normalize

You can also binarize the data (make it a 0 or 1 for example, depending on whether it’s 3 or higher)

Let’s visualize normalized ratings.

#graphing what normalized ratings look like

#raw ratings
image(MovieLense40[1:10, 1:10], main = "Uncentered Ratings")

#centered (normalized)
image(normalize(MovieLense40[1:10, 1:10], method = "center"), 
      main = "Normalized Ratings (centered)")

#Z-score normalized
image(normalize(MovieLense40[1:10, 1:10], method = "Z-score"), 
      main = "Normalized Ratings (Z-sore)")

image(MovieLense40, main = "Uncentered Ratings (all)")

With just a slice of the data, we can see how different the scores end up looking if we normalize with Z-score vs centering. The last table gives a good idea of how sparsely the data set is populated.

User-based collaborative filtering

This algorithm looks at a neighborhood of similar users to generate a set of recommendations for a user. It assumes that users who liked the same items previously will continue to like the same movies.

Before using the evaluationscheme method below (part of the recommender package), I tried splitting the test/training data manually, using the same method as in the documentation, and then using predict. This produced zero false positives, likely due to the number of blank ratings. I then used the evaluationscheme function, which allows you to split the data.

Note on methodology and organization

There are a lot of variables for each algorithm: the train/test split, given, whether to use k-fold cross-validation or a train/test split, how to normalize the data, how many neighbors (items in ibcf and users in ubcf), what to consider a good rating. I’ve designated 4 as a good rating throughout, whch seems like a general best practice for a 1-5 scale. Where relevant, I’ve kept the train/test split at 80/20. In general, I’ve tried different givens (from cold-ish start onward) and tested it on centered vs. z-score normalization, then tried it with k-fold cross-validation to see if the results differ. There are also precision-recall graphs that compare IBCF and UBCF to simple recommending a random or popular item to a user.

I’ve implemented (created) a recommendation algorithm in most places, and I’ve looked at the suggestions with “predict.” However, I’ve mostly evaluated with “evaluate,” since that’s what has allowed me to judge how well the tweaks I made to each system are working. Recommenderlab allows you to see TPs and FPs for recommenders, but, as I mention above, I ended up with mostly false positives becuase this is a large data set (though these do factor into my ultimate recommendation). So I’ve leaned toward the RMSE, MAE, and MSE to judge how well the different algorithms perform. MSE and RMSE are both sensitive to outliers, while MAE isn’t.

#splitting - thie is a VERY cold start scenario (10 ratings are given out of 40)
set.seed(1122)
scheme <- evaluationScheme(MovieLense40, 
                           method = "split", 
                           #percent of users who comprise the test set
                           train = 0.8, 
                           #number of ratings 
                           given = 10, 
                           goodRating = 4)

scheme

## Evaluation scheme with 10 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=4.000000
## Data set: 633 x 1664 rating matrix of class 'realRatingMatrix' with 90953 ratings.

rec1 <- Recommender(getData(scheme, "train"), method = "UBCF")

#predictions - let's see the recommendations for these 6 users

pre <- predict(rec1, MovieLense40[150:155], n = 5)
pre

## Recommendations as 'topNList' with n = 5 for 6 users.

as(pre, "list")

## $`0`
## [1] "Anne Frank Remembered (1995)"                        
## [2] "Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)"
## [3] "Alphaville (1965)"                                   
## [4] "Close Shave, A (1995)"                               
## [5] "Shall We Dance? (1937)"                              
## 
## $`1`
## [1] "Wild Bunch, The (1969)"           "Last of the Mohicans, The (1992)"
## [3] "Jungle Book, The (1994)"          "Deer Hunter, The (1978)"         
## [5] "39 Steps, The (1935)"            
## 
## $`2`
## [1] "Heavenly Creatures (1994)"    "M (1931)"                    
## [3] "Bride of Frankenstein (1935)" "Braindead (1992)"            
## [5] "Bad Taste (1987)"            
## 
## $`3`
## [1] "Looking for Richard (1996)"                
## [2] "Stealing Beauty (1996)"                    
## [3] "Ghost in the Shell (Kokaku kidotai) (1995)"
## [4] "Big Lebowski, The (1998)"                  
## [5] "Braveheart (1995)"                         
## 
## $`4`
## [1] "Dragonheart (1996)"               "Breakfast at Tiffany's (1961)"   
## [3] "Brady Bunch Movie, The (1995)"    "Parent Trap, The (1961)"         
## [5] "James and the Giant Peach (1996)"
## 
## $`5`
## [1] "Nightmare on Elm Street, A (1984)"   
## [2] "Incognito (1997)"                    
## [3] "Star Trek IV: The Voyage Home (1986)"
## [4] "My Fair Lady (1964)"                 
## [5] "American President, The (1995)"

#check the accuracy 
results <- evaluate(scheme, method = "UBCF", type = "ratings")

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.4sec]

plot(results, annotate=c(1,3), legend="bottomright")

avg(results)

##          RMSE      MSE       MAE
## [1,] 1.194439 1.426685 0.9389242

With a cold-ish start and centering, the RMSE is 1.19, MSE is 1.42, and MAE is .94.

Below, we use the 80/20 split above, with 10 visible items, to graph how many true/false positives there are for 1, 3, 5, and 10 recommendations. We’ll compare IBCF and UBCF to just recommending a random or popular item. Random = a baseline where you might as not well have a recommender system. Popular is a simple recommender system.

Interestingly, the popular items have the most true positives, followed by user-based. IBCF is performing worse than random. With 10 visible items, it may be best to only recommend popular items. This uses default params for k and nn (centering vs. Z-score normalization), which may put IBCF at a disadvantage: the documentation has k much higher at 100. I used this first, and the results for IBCF were much better.

algorithms <- list(`random items` = list(name = "RANDOM", param = NULL),
                   `popular items` = list(name = "POPULAR", param = NULL),
                   `user-based CF` = list(name = "UBCF", param = list(nn = 25)),
                   `item-based CF` = list(name = "IBCF", param = list(k = 30)))

results_alg <- evaluate(scheme, algorithms, type = "topNList",
                        n = c(1, 3, 5, 10, 30), progress = FALSE)

plot(results_alg, annotate = 2, legend = "topleft")

Let’s try using a z-score normalization method.

set.seed(1122)

#create recommender
rec2 <- Recommender(getData(scheme, "train"), method = "UBCF",
                    param = list(normalize = "Z-score"))

results_2 <- evaluate(scheme, method = "UBCF", type = "ratings", 
                      param = list(normalize = "Z-score"))

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.41sec]

avg(results_2)

##          RMSE      MSE       MAE
## [1,] 1.196224 1.430952 0.9324367

The numbers are very similar to the first attempt, with centering winning on everything but MAE. This may mean Z-score creates more outliers.

Using a higher number of given ratings

Let’s increase the given number by a lot.

set.seed(1122)
scheme2 <- evaluationScheme(MovieLense40, 
                           method = "split", 
                           #percent of users who comprise the test set
                           train = 0.8, 
                           #number of ratings 
                           given = 35, 
                           goodRating = 4)

scheme2

## Evaluation scheme with 35 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=4.000000
## Data set: 633 x 1664 rating matrix of class 'realRatingMatrix' with 90953 ratings.

#create the recommender
rec3 <- Recommender(getData(scheme2, "train"), method = "UBCF")

results <- evaluate(scheme2, method = "UBCF", type = "ratings")

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.36sec]

avg(results)

##          RMSE     MSE       MAE
## [1,] 1.157752 1.34039 0.9025691

Metrics are all slightly better. With z-score:

#Z-score
set.seed(1122)

#create recommender
rec4 <- Recommender(getData(scheme2, "train"), method = "UBCF",
                    param = list(normalize = "Z-score"))

results_2 <- evaluate(scheme2, method = "UBCF", type = "ratings", 
                      param = list(normalize = "Z-score"))

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.35sec]

avg(results_2)

##          RMSE      MSE       MAE
## [1,] 1.158569 1.342283 0.8946881

Models are performing similarly. Z-score does better on MAE, but slightly worse on RMSE and MSE. Again, this may be outlier-related.

Let’s graph to see if 30 visible items is appreciably better than (I’ve changed the given to 30 so every user has at least 10 items that can be tested):

set.seed(1122)
scheme_alg <- evaluationScheme(MovieLense40, 
                           method = "split", 
                           #percent of users who comprise the test set
                           train = 0.8, 
                           #number of ratings 
                           given = 30, 
                           goodRating = 4)

results_alg_2 <- evaluate(scheme_alg, algorithms, type = "topNList", n = c(1, 3, 5, 10), progress = FALSE)

plot(results_alg_2, annotate = 2, legend = "topleft")

User-based is still doing better than item-based, as is a random suggestion. Simply recommending a popular item is still doing better than everything.

Using K-fold cross-validation (no training/test split)

#10-fold cross-validation - cold-start (10 ratings)
set.seed(1122)
scheme3 <- evaluationScheme(MovieLense40, 
                           method = "cross-validation", 
                           k = 10, 
                           given = 10, 
                           goodRating = 4)

#create recommenders
rec5 <- Recommender(getData(scheme3, "train"), method = "UBCF")

rec6 <- Recommender(getData(scheme3, "train"), method = "UBCF",
                    param = list(normalize = "Z-score"))


#results

results_kfold <- evaluate(scheme3, method = "UBCF", type = "ratings")

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.24sec] 
##   2  [0sec/0.22sec] 
##   3  [0sec/0.21sec] 
##   4  [0sec/0.24sec] 
##   5  [0.02sec/0.21sec] 
##   6  [0sec/0.25sec] 
##   7  [0sec/0.22sec] 
##   8  [0.02sec/0.2sec] 
##   9  [0.02sec/0.19sec] 
##   10  [0sec/0.42sec]

results__kfold_2 <- evaluate(scheme3, method = "UBCF", type = "ratings", 
                      param = list(normalize = "Z-score"))

## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.19sec] 
##   2  [0.02sec/0.18sec] 
##   3  [0sec/0.2sec] 
##   4  [0sec/0.2sec] 
##   5  [0.02sec/0.21sec] 
##   6  [0.02sec/0.18sec] 
##   7  [0.02sec/0.2sec] 
##   8  [0.01sec/0.21sec] 
##   9  [0.02sec/0.25sec] 
##   10  [0.01sec/0.2sec]

avg(results_kfold)

##          RMSE      MSE       MAE
## [1,] 1.191329 1.419601 0.9355314

avg(results__kfold_2)

##          RMSE      MSE       MAE
## [1,] 1.197822 1.435525 0.9297829

#with 35 visible ratings
set.seed(1122)
scheme4 <- evaluationScheme(MovieLense40, 
                           method = "cross-validation", 
                           k = 10, 
                           given = 35, 
                           goodRating = 4)


#create recommenders
rec7 <- Recommender(getData(scheme4, "train"), method = "UBCF")

rec8 <- Recommender(getData(scheme4, "train"), method = "UBCF",
                    param = list(normalize = "Z-score"))

results_kfold <- evaluate(scheme4, method = "UBCF", type = "ratings")

## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.21sec] 
##   2  [0.02sec/0.22sec] 
##   3  [0sec/0.23sec] 
##   4  [0sec/0.23sec] 
##   5  [0.01sec/0.19sec] 
##   6  [0sec/0.2sec] 
##   7  [0sec/0.22sec] 
##   8  [0sec/0.25sec] 
##   9  [0sec/0.21sec] 
##   10  [0sec/0.22sec]

results__kfold_2 <- evaluate(scheme4, method = "UBCF", type = "ratings", 
                      param = list(normalize = "Z-score"))

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.23sec] 
##   2  [0.02sec/0.25sec] 
##   3  [0.01sec/0.22sec] 
##   4  [0sec/0.22sec] 
##   5  [0sec/0.22sec] 
##   6  [0sec/0.2sec] 
##   7  [0.02sec/0.23sec] 
##   8  [0sec/0.23sec] 
##   9  [0.01sec/0.22sec] 
##   10  [0.02sec/0.22sec]

avg(results_kfold)

##          RMSE      MSE      MAE
## [1,] 1.177255 1.386402 0.921188

avg(results__kfold_2)

##          RMSE      MSE       MAE
## [1,] 1.178346 1.389506 0.9142137

Both models are slightly worse with K-fold cross-validation, compared to a training/test split.

Trying with K = 5 (5-fold cross validation):

set.seed(1122)
scheme5 <- evaluationScheme(MovieLense40, 
                           method = "cross-validation", 
                           k = 5, 
                           given = 10, 
                           goodRating = 4)


rec9 <- Recommender(getData(scheme5, "train"), method = "UBCF")

rec10 <- Recommender(getData(scheme5, "train"), method = "UBCF",
                             parameter = list(normalize = "Z-score"))



results_kfold <- evaluate(scheme5, method = "UBCF", type = "ratings")

## UBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/0.57sec] 
##   2  [0.02sec/0.36sec] 
##   3  [0sec/0.4sec] 
##   4  [0.02sec/0.36sec] 
##   5  [0sec/0.38sec]

results__kfold_2 <- evaluate(scheme5, method = "UBCF", type = "ratings", 
                      param = list(normalize = "Z-score"))

## UBCF run fold/sample [model time/prediction time]
##   1  [0.03sec/0.4sec] 
##   2  [0.02sec/0.36sec] 
##   3  [0sec/0.38sec] 
##   4  [0sec/0.39sec] 
##   5  [0.01sec/0.4sec]

avg(results_kfold)

##          RMSE      MSE       MAE
## [1,] 1.190219 1.416783 0.9346346

avg(results__kfold_2)

##          RMSE      MSE       MAE
## [1,] 1.196284 1.431477 0.9286537

K=5 comes with slightly worse metrics than 10-fold validation.

Let’s try the model on users with slightly more data:

MovieLense90 <- MovieLense[rowCounts(MovieLense) > 90, ]

#383 users left
MovieLense90

## 383 x 1664 rating matrix of class 'realRatingMatrix' with 75998 ratings.

I’ve also increased the neighborhood size here, to see if more neighbors improves the metrics.

set.seed(1122)
scheme_big <- evaluationScheme(MovieLense90, 
                           method = "split", 
                           #percent of users who comprise the test set
                           train = 0.8, 
                           #number of ratings 
                           given = 65, 
                           goodRating = 4)

scheme_big

## Evaluation scheme with 65 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=4.000000
## Data set: 383 x 1664 rating matrix of class 'realRatingMatrix' with 75998 ratings.

Recommender(getData(scheme_big, "train"), method = "UBCF", param = list(nn = 3))

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 306 users.

#neighborhood sizes of 25 (the default) and 50, then 80
results_big <- evaluate(scheme_big, method = "UBCF", type = "ratings",
                        param = list(nn = 25))

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.23sec]

results_big_2 <- evaluate(scheme_big, method = "UBCF", type = "ratings",
                        param = list(nn = 50))

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.22sec]

results_big_3 <- evaluate(scheme_big, method = "UBCF", type = "ratings",
                        param = list(nn = 80))

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.25sec]

results_big_3z <- evaluate(scheme_big, method = "UBCF", type = "ratings",
                        param = list(nn = 80, normalize = "Z-score"))

## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.24sec]

avg(results_big)

##          RMSE     MSE       MAE
## [1,] 1.061452 1.12668 0.8321021

avg(results_big_2)

##          RMSE      MSE       MAE
## [1,] 1.001691 1.003386 0.7878649

#Z-score and centering are very close
avg(results_big_3)

##           RMSE       MSE       MAE
## [1,] 0.9838545 0.9679696 0.7723598

avg(results_big_3z)

##           RMSE       MSE       MAE
## [1,] 0.9860828 0.9723594 0.7701571

The larger number of ratings made for a much lower RMSE and MAE. Using a large k for knn brings it down even more.

Let’s see if one system performs better as it gives more ratings (a min of 90 recommendations means being able to test more recommendations):

set.seed(1122)
results_alg_3 <- evaluate(scheme_big, algorithms, type = "topNList", 
                          n = c(1, 3, 5, 10, 20, 40, 60),progress = FALSE)

plot(results_alg_3, annotate = 2, legend = "topleft")

This looks pretty similar to previous iterations, with IBCF performing worse than recommending random items.

Finding the optimal value for knn (neighborhood size) with UBCF

#if I run this twice, this removes the object so it doesn't append
rm(rmse_values)

## Warning in rm(rmse_values): object 'rmse_values' not found

k_values <- c(3, 5, 10, 20, 30, 40, 50)
rmse_values <- c()

set.seed(1122)
scheme_cv <- evaluationScheme(MovieLense40,
                              method     = "cross-validation",
                              k          = 10,
                              given      = 35,
                              goodRating = 4)

for (k in k_values) {
  results_k <- evaluate(scheme_cv, method = "UBCF", type = "ratings",
                        param = list(normalize = "center", nn = k))
 rmse_values <- c(rmse_values, avg(results_k)[1, "RMSE"])
}

## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.21sec] 
##   2  [0sec/0.23sec] 
##   3  [0.01sec/0.19sec] 
##   4  [0sec/0.21sec] 
##   5  [0sec/0.19sec] 
##   6  [0sec/0.2sec] 
##   7  [0sec/0.18sec] 
##   8  [0sec/0.19sec] 
##   9  [0sec/0.2sec] 
##   10  [0.02sec/0.17sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.2sec] 
##   2  [0.02sec/0.18sec] 
##   3  [0sec/0.2sec] 
##   4  [0.02sec/0.18sec] 
##   5  [0sec/0.2sec] 
##   6  [0sec/0.23sec] 
##   7  [0sec/0.21sec] 
##   8  [0.01sec/0.36sec] 
##   9  [0sec/0.18sec] 
##   10  [0sec/0.21sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.2sec] 
##   2  [0sec/0.2sec] 
##   3  [0sec/0.2sec] 
##   4  [0sec/0.21sec] 
##   5  [0sec/0.21sec] 
##   6  [0sec/0.2sec] 
##   7  [0sec/0.2sec] 
##   8  [0sec/0.36sec] 
##   9  [0sec/0.24sec] 
##   10  [0.02sec/0.19sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.2sec] 
##   2  [0sec/0.21sec] 
##   3  [0.02sec/0.2sec] 
##   4  [0sec/0.22sec] 
##   5  [0.01sec/0.21sec] 
##   6  [0.01sec/0.2sec] 
##   7  [0sec/0.22sec] 
##   8  [0.02sec/0.2sec] 
##   9  [0sec/0.24sec] 
##   10  [0sec/0.19sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.22sec] 
##   2  [0sec/0.39sec] 
##   3  [0.02sec/0.22sec] 
##   4  [0sec/0.22sec] 
##   5  [0sec/0.21sec] 
##   6  [0.02sec/0.2sec] 
##   7  [0sec/0.22sec] 
##   8  [0sec/0.21sec] 
##   9  [0sec/0.22sec] 
##   10  [0sec/0.22sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.21sec] 
##   2  [0.01sec/0.24sec] 
##   3  [0sec/0.22sec] 
##   4  [0sec/0.23sec] 
##   5  [0sec/0.23sec] 
##   6  [0sec/0.24sec] 
##   7  [0sec/0.22sec] 
##   8  [0sec/0.24sec] 
##   9  [0sec/0.23sec] 
##   10  [0sec/0.22sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.27sec] 
##   2  [0.01sec/0.24sec] 
##   3  [0.01sec/0.22sec] 
##   4  [0sec/0.23sec] 
##   5  [0sec/0.26sec] 
##   6  [0.02sec/0.22sec] 
##   7  [0sec/0.41sec] 
##   8  [0sec/0.23sec] 
##   9  [0sec/0.26sec] 
##   10  [0sec/0.24sec]

print(rmse_values)

##     RMSE     RMSE     RMSE     RMSE     RMSE     RMSE     RMSE 
## 1.348121 1.314264 1.268541 1.204103 1.157759 1.121536 1.095474

#plot the results
plot(k_values, rmse_values, type = "b", 
     xlab = "k (neighborhood size)", ylab = "RMSE",
     main = "RMSE vs neighborhood size")

As K goes up, RMSE goes down. The optimal neighborhood size could be much larger than 50.

rm(rmse_values)

k_values <- c(40, 50, 60, 70, 80, 120, 200, 300)
rmse_values <- c()

set.seed(1122)
scheme_cv <- evaluationScheme(MovieLense40,
                              method     = "cross-validation",
                              k          = 10,
                              given      = 35,
                              goodRating = 4)

for (k in k_values) {
  results_k <- evaluate(scheme_cv, method = "UBCF", type = "ratings",
                        param = list(normalize = "center", nn = k))
 rmse_values <- c(rmse_values, avg(results_k)[1, "RMSE"])
}

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.24sec] 
##   2  [0sec/0.22sec] 
##   3  [0.02sec/0.26sec] 
##   4  [0sec/0.24sec] 
##   5  [0.02sec/0.22sec] 
##   6  [0sec/0.22sec] 
##   7  [0sec/0.24sec] 
##   8  [0sec/0.23sec] 
##   9  [0sec/0.23sec] 
##   10  [0sec/0.21sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.23sec] 
##   2  [0sec/0.23sec] 
##   3  [0sec/0.22sec] 
##   4  [0sec/0.24sec] 
##   5  [0sec/0.24sec] 
##   6  [0sec/0.24sec] 
##   7  [0sec/0.23sec] 
##   8  [0sec/0.43sec] 
##   9  [0sec/0.26sec] 
##   10  [0sec/0.24sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/0.24sec] 
##   2  [0.01sec/0.22sec] 
##   3  [0sec/0.26sec] 
##   4  [0sec/0.25sec] 
##   5  [0sec/0.22sec] 
##   6  [0sec/0.25sec] 
##   7  [0sec/0.27sec] 
##   8  [0sec/0.25sec] 
##   9  [0sec/0.25sec] 
##   10  [0sec/0.25sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.25sec] 
##   2  [0sec/0.25sec] 
##   3  [0.01sec/0.24sec] 
##   4  [0sec/0.26sec] 
##   5  [0sec/0.24sec] 
##   6  [0sec/0.25sec] 
##   7  [0sec/0.25sec] 
##   8  [0sec/0.26sec] 
##   9  [0sec/0.25sec] 
##   10  [0.02sec/0.23sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.25sec] 
##   2  [0sec/0.28sec] 
##   3  [0sec/0.25sec] 
##   4  [0.02sec/0.25sec] 
##   5  [0sec/0.27sec] 
##   6  [0sec/0.25sec] 
##   7  [0sec/0.23sec] 
##   8  [0sec/0.25sec] 
##   9  [0sec/0.25sec] 
##   10  [0sec/0.31sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.3sec] 
##   2  [0.01sec/0.28sec] 
##   3  [0sec/0.28sec] 
##   4  [0sec/0.48sec] 
##   5  [0sec/0.28sec] 
##   6  [0sec/0.3sec] 
##   7  [0sec/0.3sec] 
##   8  [0sec/0.29sec] 
##   9  [0sec/0.31sec] 
##   10  [0.01sec/0.31sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.34sec] 
##   2  [0sec/0.33sec] 
##   3  [0.02sec/0.41sec] 
##   4  [0sec/0.35sec] 
##   5  [0sec/0.33sec] 
##   6  [0sec/0.38sec] 
##   7  [0sec/0.33sec] 
##   8  [0sec/0.36sec] 
##   9  [0.02sec/0.31sec] 
##   10  [0sec/0.36sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.45sec] 
##   2  [0sec/0.59sec] 
##   3  [0sec/0.45sec] 
##   4  [0sec/0.4sec] 
##   5  [0sec/0.44sec] 
##   6  [0sec/0.43sec] 
##   7  [0.02sec/0.39sec] 
##   8  [0sec/0.43sec] 
##   9  [0sec/0.42sec] 
##   10  [0.02sec/0.44sec]

print(rmse_values)

##      RMSE      RMSE      RMSE      RMSE      RMSE      RMSE      RMSE      RMSE 
## 1.1215357 1.0954740 1.0687032 1.0483898 1.0310267 0.9896475 0.9643476 0.9598891

#plot the results
plot(k_values, rmse_values, type = "b", 
     xlab = "k (neighborhood size)", ylab = "RMSE",
     main = "RMSE vs neighborhood size")

RMSE keeps going down with a larger neighborhood size, but the reduction becomes less dramatic after about 60. Large neighborhood sizes also come with the risk of underfitting, and they are computationally expensive.

Item-based collaborative filtering

Here, I’ll test the same conditions with an IBCF system. We’ll take a look at the RMSE, MSE, and MAE for the same models.

This algorithm focuses on the similarities between items. It recommends items similar to ones a user has liked, based on other users’ behavior. It finds movies similar to the ones a user rated highly. My methodology is the same as for UBCF. Overall, with the default settings, IBCF performs worse than UBCF for this data set. Toward the end, I adjust neighborhood size to account for this. I’ve also included a couple of precision-recall curves to look at the effects of larger neighborhood sizes.

Using the initial train/test split and 10 visible items (scheme):

rec11 <- Recommender(getData(scheme, "train"), method = "IBCF")

results_IBCF_centered <- evaluate(scheme, method = "IBCF", type = "ratings")

## IBCF run fold/sample [model time/prediction time]
##   1  [2.12sec/0.02sec]

avg(results_IBCF_centered)

##         RMSE     MSE      MAE
## [1,] 1.49684 2.24053 1.189394

results_IBCF_z <- evaluate(scheme, method = "IBCF", type = "ratings",
                             parameter = list(normalize = "Z-score"))

## IBCF run fold/sample [model time/prediction time]
##   1  [2.12sec/0sec]

avg(results_IBCF_centered)

##         RMSE     MSE      MAE
## [1,] 1.49684 2.24053 1.189394

avg(results_IBCF_z)

##          RMSE     MSE      MAE
## [1,] 1.500943 2.25283 1.192453

Centering preforms slightly better overall. The RMSE, MAE, and MSE are all much worse than the same UBCF model.

With 35 ratings given to the recommender, all the metrics get a little better. They are still much worse than UBCF for the same scheme.

results_IBCF_centered2 <- evaluate(scheme2, method = "IBCF", type = "ratings")

## IBCF run fold/sample [model time/prediction time]
##   1  [2.16sec/0sec]

results_IBCF_z2 <- evaluate(scheme2, method = "IBCF", type = "ratings",
                             parameter = list(normalize = "Z-score"))

## IBCF run fold/sample [model time/prediction time]
##   1  [2.11sec/0.01sec]

avg(results_IBCF_centered2)

##          RMSE      MSE      MAE
## [1,] 1.486972 2.211084 1.138925

avg(results_IBCF_z2)

##          RMSE      MSE      MAE
## [1,] 1.485236 2.205926 1.136941

With k-fold cross-validation and a given of 10 (again, these metrics are all much worse than UBCF for the same scheme):

results_kfold_IBCF <- evaluate(scheme3, method = "IBCF", type = "ratings")

## IBCF run fold/sample [model time/prediction time]
##   1  [2.42sec/0sec] 
##   2  [2.15sec/0sec] 
##   3  [2.15sec/0.01sec] 
##   4  [2.19sec/0sec] 
##   5  [2.2sec/0.01sec] 
##   6  [2.32sec/0sec] 
##   7  [2.37sec/0sec] 
##   8  [2.19sec/0sec] 
##   9  [2.32sec/0.01sec] 
##   10  [2.19sec/0sec]

results__kfold_IBCF_z <- evaluate(scheme3, method = "IBCF", type = "ratings", 
                      param = list(normalize = "Z-score"))

## IBCF run fold/sample [model time/prediction time]
##   1  [2.18sec/0sec] 
##   2  [2.15sec/0sec] 
##   3  [2.17sec/0.01sec] 
##   4  [2.18sec/0sec] 
##   5  [2.22sec/0sec] 
##   6  [2.18sec/0.01sec] 
##   7  [2.32sec/0sec] 
##   8  [2.32sec/0.02sec] 
##   9  [2.2sec/0.02sec] 
##   10  [2.1sec/0.01sec]

avg(results_kfold_IBCF)

##          RMSE      MSE      MAE
## [1,] 1.362398 1.901423 1.042297

avg(results__kfold_IBCF_z)

##          RMSE     MSE      MAE
## [1,] 1.359943 1.89274 1.039222

Z-score normalization is outperforming centering here. It’s much worse than the IBCF model with the same parameters.

With a larger given of 35:

results_kfold_ibcf_2 <- evaluate(scheme4, method = "IBCF", type = "ratings")

## IBCF run fold/sample [model time/prediction time]
##   1  [2.17sec/0sec] 
##   2  [2.17sec/0sec] 
##   3  [2.13sec/0.01sec] 
##   4  [2.16sec/0.02sec] 
##   5  [2.32sec/0sec] 
##   6  [2.2sec/0.02sec] 
##   7  [2.25sec/0sec] 
##   8  [2.13sec/0.02sec] 
##   9  [2.18sec/0.01sec] 
##   10  [2.18sec/0sec]

results__kfold_ibcf_z2 <- evaluate(scheme4, method = "IBCF", type = "ratings", 
                      param = list(normalize = "Z-score"))

## IBCF run fold/sample [model time/prediction time]
##   1  [2.5sec/0sec] 
##   2  [2.24sec/0sec] 
##   3  [2.31sec/0.02sec] 
##   4  [2.31sec/0sec] 
##   5  [2.2sec/0sec] 
##   6  [2.37sec/0.02sec] 
##   7  [2.23sec/0.02sec] 
##   8  [2.44sec/0sec] 
##   9  [2.15sec/0.01sec] 
##   10  [2.29sec/0sec]

avg(results_kfold_ibcf_2)

##          RMSE      MSE      MAE
## [1,] 1.436275 2.089153 1.062142

avg(results__kfold_ibcf_z2)

##          RMSE      MSE      MAE
## [1,] 1.437901 2.093264 1.064343

Somehow, with a higher given, these are worse the cold-start model for cross-validation. I will skip scheme 5, which just had a different number of cross-validations.

Increasing k

It’s clear that this model needs a much higher k than the default 30. We’ll use scheme 2 again (35 visible ratings) and increase the k. These are all centered.

results_k_100 <- evaluate(scheme2, method = "IBCF", type = "ratings",
                        param = list(k = 100))

## IBCF run fold/sample [model time/prediction time]
##   1  [1.97sec/0.03sec]

results_k_200 <- evaluate(scheme2, method = "IBCF", type = "ratings",
                        param = list(k = 200))

## IBCF run fold/sample [model time/prediction time]
##   1  [2.02sec/0.04sec]

results_k_300 <- evaluate(scheme2, method = "IBCF", type = "ratings",
                        param = list(k = 300))

## IBCF run fold/sample [model time/prediction time]
##   1  [2sec/0.05sec]

results_k_500 <- evaluate(scheme2, method = "IBCF", type = "ratings",
                        param = list(k = 500))

## IBCF run fold/sample [model time/prediction time]
##   1  [2.05sec/0.08sec]

#check the results
avg(results_k_100)

##          RMSE      MSE      MAE
## [1,] 1.408638 1.984261 1.072098

avg(results_k_200)

##          RMSE      MSE     MAE
## [1,] 1.211184 1.466967 0.91295

avg(results_k_300)

##          RMSE      MSE       MAE
## [1,] 1.090548 1.189296 0.8385303

avg(results_k_500)

##           RMSE       MSE       MAE
## [1,] 0.9962403 0.9924948 0.7810785

The numbers improve with a larger k.

This uses the data set with users who have rated more than 90 movies, and these are all centered:

results_big_100 <- evaluate(scheme_big, method = "IBCF", type = "ratings",
                        param = list(k = 100))

## IBCF run fold/sample [model time/prediction time]
##   1  [1.58sec/0.02sec]

results_big_500 <- evaluate(scheme_big, method = "IBCF", type = "ratings",
                        param = list(k = 500))

## IBCF run fold/sample [model time/prediction time]
##   1  [1.42sec/0.03sec]

results_big_600 <- evaluate(scheme_big, method = "IBCF", type = "ratings",
                        param = list(k = 600))

## IBCF run fold/sample [model time/prediction time]
##   1  [1.43sec/0.05sec]

#things start to get worse around here
results_big_700 <- evaluate(scheme_big, method = "IBCF", type = "ratings",
                        param = list(k = 700))

## IBCF run fold/sample [model time/prediction time]
##   1  [1.48sec/0.05sec]

results_big_1000 <- evaluate(scheme_big, method = "IBCF", type = "ratings",
                        param = list(k = 1000))

## IBCF run fold/sample [model time/prediction time]
##   1  [1.61sec/0.05sec]

#this is close to the largest size becuase the database only has ~1600 movies
results_big_1600 <- evaluate(scheme_big, method = "IBCF", type = "ratings",
                        param = list(k = 1600))

## IBCF run fold/sample [model time/prediction time]
##   1  [1.49sec/0.06sec]

#check the results
avg(results_big_100)

##          RMSE      MSE     MAE
## [1,] 1.515679 2.297283 1.14449

avg(results_big_500)

##           RMSE       MSE       MAE
## [1,] 0.9864161 0.9730167 0.7836975

avg(results_big_600)

##           RMSE       MSE       MAE
## [1,] 0.9797545 0.9599188 0.7793573

avg(results_big_700)

##           RMSE       MSE       MAE
## [1,] 0.9809035 0.9621716 0.7803528

avg(results_big_1000)

##           RMSE       MSE       MAE
## [1,] 0.9985495 0.9971011 0.7967567

avg(results_big_1600)

##          RMSE      MSE       MAE
## [1,] 1.011768 1.023674 0.8081521

The metrics actually get worse after about 600, but at their best, they get close to UBCF results with a higher nn value.

Looking at true positives and negatives one more time with adjusted neighborhood sizes

The RMSE, MSE, and MAE all go down as the neighborhood sizes go up. This is not true for TPs and TNs because these are measuring different things. The RMSE measures how accurate the rating is, TP/FP measures whether a user watched/rated something in the model’s top n items.. With such a large data set, it’s easier to get close to the user’s rating than it is to pick something they liked and watched. They have the highest chance of having rated a popular movie, which is why the popular algorithm wins every time.

I played with these numbers a little bit, and k = 300 looks better for this data set, helping IBCF edge out random for the first time. Bringing nn to 2 or 3 gives UBCF a higher success rate than IBCF, but that’s a little suspicious and might be a result of overfitting. I’ve left it at 20 here because it’s different from the default, but still outperforms random.

algorithms2 <- list(`random items` = list(name = "RANDOM", param = NULL),
                   `popular items` = list(name = "POPULAR", param = NULL),
                   `user-based CF` = list(name = "UBCF", param = list(nn = 20)),
                   `item-based CF` = list(name = "IBCF", param = list(k = 300)))

results_alg_4 <- evaluate(scheme_big, algorithms2, type = "topNList",
                        n = c(1, 3, 5, 10, 20, 40, 60, 80, 100), progress = FALSE)

plot(results_alg_4, annotate = 2, legend = "topleft")

And if we go very high with nn and k, the model is oversmoothing/underfitting and everyone’s preferences start to look the same. UBCF does start to look better with more recommendations.

algorithms3 <- list(`random items` = list(name = "RANDOM", param = NULL),
                   `popular items` = list(name = "POPULAR", param = NULL),
                   `user-based CF` = list(name = "UBCF", param = list(nn = 300)),
                   `item-based CF` = list(name = "IBCF", param = list(k = 1600)))

results_alg_5 <- evaluate(scheme_big, algorithms3, type = "topNList",
                        n = c(1, 3, 5, 10, 20, 40, 60, 80, 100), progress = FALSE)

plot(results_alg_5, annotate = 2, legend = "topleft")

Summary and recommendations

There are a lot of different variables at play: the number of ratings the model has access to, knn, how to center, whether to use k-fold cross-validation or use a training/test set. In general, it does seem that k/nn should be much higher than the default values, as these reduce the MSE, MAE, and RMSE.

Sometimes Z-score normalization works better, most of the time regular centering does. The metrics were largely comparable for the same model.

With large enough neighborhood sizes, the RMSE, MAE, and MSE just about evened out for both models (metrics started getting worse for IBCF after about 600, and just kept getting better with diminishing returns for UBCF). In general, a larger neighborhood size meant better metrics, but this runs the risk of overfitting, rating everything the average, or just recommending popular itmes. It’s a less customized algorithm, and increasing k is computationally expensive.

Which model?

When looking at the TP vs. FP precision-recall curves, UBCF tends to work better with smaller neighborhood sizes and access to fewer rated items, so it’s likely to perform better with a cold start system. Because of this, I recommend a UBCF model with neighborhood size of about 60 (the point around which we saw diminishing returns) and a given of 35, which can accommodate most users. The data set where everyone had to have at least 90 ratings was much smaller.

In general, breaking the data into a train/test split gave better results. K-fold cross-validation also takes longer and is more computationally expensive, so I recommend using a train/test split. While centering is a more understandable method of normalization, Z-score gives slightly better metrics for this model.

The recommended model:

set.seed(1122)
scheme_final <- evaluationScheme(MovieLense40, 
                           method = "split", 
                           #percent of users who comprise the test set
                           train = 0.8, 
                           #number of ratings 
                           given = 35, 
                           goodRating = 4)


#check the accuracy 
results_final <- evaluate(scheme_final, method = "UBCF", type = "ratings",
                          param = list(nn = 60))

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.49sec]

results_final_z <- evaluate(scheme_final, method = "UBCF", type = "ratings",
                          param = list(nn = 60, normalize = "Z-score"))

## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.44sec]

avg(results_final)

##          RMSE      MSE       MAE
## [1,] 1.055257 1.113567 0.8245945

#the recommended recommender model
avg(results_final_z)

##          RMSE      MSE       MAE
## [1,] 1.053888 1.110679 0.8186928

Recommending popular items

The precisiom-recall curves also show that, for a system without a lot of data, sometimes it makes sense to just recommend popular items, as unpersonalized as that is. In the graph of different models vs. random and popular, popular won by a lot every time. It just doesn’t help with discoverability.

Sources

Joe Domaleski. UBCF vs. IBCF: Comparing Marketing Recommendation System Algorithms in R. April 2025. https://blog.marketingdatascience.ai/ubcf-vs-ibcf-comparing-marketing-recommendation-system-algorithms-in-r-38ff36bf05d3

Michael Hahlser, recommenderlab: An R Framework for Developing and Testing Recommendation Algorithms. May 2022. https://arxiv.org/abs/2205.12371

Claude Sonnet, 4.6. [Large language model] Accessed June 2026. Claude.ai (to make the neighborhood size vs RMSE graph and to figure out that this package uses three separate variables that can all be referred to as k)

Project 2

Sam Barbaro

2026-06-07

Project 2

Exploratory analysis

Most-rated and top-rated movies

Number of ratings by user

Evaluating different recommendations systems

Minimum number of ratings

Normalizing the ratings

User-based collaborative filtering

Note on methodology and organization

Using a higher number of given ratings

Using K-fold cross-validation (no training/test split)

Finding the optimal value for knn (neighborhood size) with UBCF

Item-based collaborative filtering

Increasing k

Looking at true positives and negatives one more time with adjusted neighborhood sizes

Summary and recommendations

Recommending popular items

Sources