DATA612Project2

Author

Semyon Toybis

Project 2

Project 2 requires implementing two of content-based filtering, user-user collaborate filtering, or item-item collaborative filtering models on a data set of choice.

I will work with the MovieLens small data set, saved in my Github repository.

Below I load the necessary libraries and import the data set

library(tidyverse)
library(recommenderlab)
library(superml)
movies <- read_csv('https://raw.githubusercontent.com/stoybis/DATA612/refs/heads/main/Project2/movies.csv')

ratings <- read_csv('https://raw.githubusercontent.com/stoybis/DATA612/refs/heads/main/Project2/ratings.csv')

Exploratory Analysis

head(movies)
# A tibble: 6 × 3
  movieId title                              genres                             
    <dbl> <chr>                              <chr>                              
1       1 Toy Story (1995)                   Adventure|Animation|Children|Comed…
2       2 Jumanji (1995)                     Adventure|Children|Fantasy         
3       3 Grumpier Old Men (1995)            Comedy|Romance                     
4       4 Waiting to Exhale (1995)           Comedy|Drama|Romance               
5       5 Father of the Bride Part II (1995) Comedy                             
6       6 Heat (1995)                        Action|Crime|Thriller              
head(ratings)
# A tibble: 6 × 4
  userId movieId rating timestamp
   <dbl>   <dbl>  <dbl>     <dbl>
1      1       1      4 964982703
2      1       3      4 964981247
3      1       6      4 964982224
4      1      47      5 964983815
5      1      50      5 964982931
6      1      70      3 964982400
length(unique(movies$movieId))
[1] 9742
length(unique(ratings$userId))
[1] 610
length(unique(ratings$movieId))
[1] 9724

There are more movies than there are rated movies - thus I will subset the movie data for movies that were rated

movies <- movies |> filter(movieId %in% ratings$movieId)

Below is the distribution of ratings:

ratings |> ggplot(aes(x = '', y= rating)) + geom_boxplot() +
 stat_summary(fun = mean, geom = 'point', shape = 4, size = 5, color = 'red') + ggtitle('Distribution of Movie Ratings') + theme_minimal()

Below is the distribution of the amount of movies that each viewer has rated:

ratings |> group_by(userId) |> summarise(num_ratings = n_distinct(movieId)) |>
  ggplot(aes(x = num_ratings, y ='')) + geom_boxplot() +
   stat_summary(fun = mean, geom = 'point', shape = 4, size = 5, color = 'red') + ggtitle('Distribution of Ratings per User') + theme_minimal()

The average movie rating is 3.5 and there is a long tail of users who have a large amount of ratings, while the average is 165:

ratings |> group_by(userId) |> summarise(num_ratings = n()) |> select(num_ratings) |> summary()
  num_ratings    
 Min.   :  20.0  
 1st Qu.:  35.0  
 Median :  70.5  
 Mean   : 165.3  
 3rd Qu.: 168.0  
 Max.   :2698.0  

Below is an image that shows the distribution of the number of ratings per movie:

ratings |> group_by(movieId) |> summarise(num_ratings = n()) |>
  ggplot(aes(x = num_ratings, y ='')) + geom_boxplot() +
   stat_summary(fun = mean, geom = 'point', shape = 4, size = 5, color = 'red') + ggtitle('Distribution of Ratings per Movie') + theme_minimal()

ratings |> group_by(movieId) |> summarise(num_ratings = n()) |> select(num_ratings) |> summary()
  num_ratings    
 Min.   :  1.00  
 1st Qu.:  1.00  
 Median :  3.00  
 Mean   : 10.37  
 3rd Qu.:  9.00  
 Max.   :329.00  

Data Prep

First, I transform the ratings data into a matrix that can be used with recommenderlab functions

ratings_matrix <- ratings |> select(-timestamp) |>
  pivot_wider(names_from = movieId, values_from = rating)

ratings_matrix <- as.matrix(ratings_matrix[,-1])
ratings_matrix <- as(ratings_matrix, 'realRatingMatrix')
ratings_matrix
610 x 9724 rating matrix of class 'realRatingMatrix' with 100836 ratings.

I now have a ratings matrix where each row is a user and each column is a movie and each viewer’s ratings for a movie are populated in the respective column. This is a sparse matrix (many values are NA) and only non-NA values are stored explicitly for efficient handling. Below is a snippet of the matrix

getRatingMatrix(ratings_matrix[1:7,1:7])
7 x 7 sparse Matrix of class "dgCMatrix"
       1 3 6 47  50 70 101
[1,] 4.0 4 4  5 5.0  3   5
[2,] .   . .  . .    .   .
[3,] .   . .  . .    .   .
[4,] .   . .  2 .    .   .
[5,] 4.0 . .  . 4.0  .   .
[6,] .   5 4  4 1.0  .   .
[7,] 4.5 . .  . 4.5  .   .

Next, I create an item matrix which describes each item (movie) based on its features (genre). Features take 0 or 1 values if the genre applies to the movie. I use the CountVectorizer function from the superml package.

cfv <- CountVectorizer$new()

item_feature_matrix <- cfv$fit_transform(movies$genres)
item_feature_matrix <- as(item_feature_matrix, 'binaryRatingMatrix')
item_feature_matrix
9724 x 23 rating matrix of class 'binaryRatingMatrix' with 23145 ratings.
head(as(item_feature_matrix, 'matrix'))
  drama comedy thriller action romance adventure crime   sci    fi horror
1 FALSE   TRUE    FALSE  FALSE   FALSE      TRUE FALSE FALSE FALSE  FALSE
2 FALSE  FALSE    FALSE  FALSE   FALSE      TRUE FALSE FALSE FALSE  FALSE
3 FALSE   TRUE    FALSE  FALSE    TRUE     FALSE FALSE FALSE FALSE  FALSE
4  TRUE   TRUE    FALSE  FALSE    TRUE     FALSE FALSE FALSE FALSE  FALSE
5 FALSE   TRUE    FALSE  FALSE   FALSE     FALSE FALSE FALSE FALSE  FALSE
6 FALSE  FALSE     TRUE   TRUE   FALSE     FALSE  TRUE FALSE FALSE  FALSE
  fantasy children animation mystery documentary   war musical western  imax
1    TRUE     TRUE      TRUE   FALSE       FALSE FALSE   FALSE   FALSE FALSE
2    TRUE     TRUE     FALSE   FALSE       FALSE FALSE   FALSE   FALSE FALSE
3   FALSE    FALSE     FALSE   FALSE       FALSE FALSE   FALSE   FALSE FALSE
4   FALSE    FALSE     FALSE   FALSE       FALSE FALSE   FALSE   FALSE FALSE
5   FALSE    FALSE     FALSE   FALSE       FALSE FALSE   FALSE   FALSE FALSE
6   FALSE    FALSE     FALSE   FALSE       FALSE FALSE   FALSE   FALSE FALSE
   noir  film listed genres
1 FALSE FALSE  FALSE  FALSE
2 FALSE FALSE  FALSE  FALSE
3 FALSE FALSE  FALSE  FALSE
4 FALSE FALSE  FALSE  FALSE
5 FALSE FALSE  FALSE  FALSE
6 FALSE FALSE  FALSE  FALSE

Modeling

I will compare item based and user based collaborative filtering, examining which algorithm results in the lowest error when comparing actual ratings vs predicted ratings. Both models will be compared to the “RANDOM” model in the recommenderlab package, which generates random recommendations. This baseline is necessary to have a point of comparison in order to determine if the models are better than guessing.

User based collaborative filtering groups users into neighborhoods based on similarity of tastes. In other words, users who rate items similarly have similar tastes and thus a user can be recommended an item that other uses in the neighborhood enjoyed. Similarity can be defined via different metrics, such as cosine similarity (default), pearson correlation, or Jaccard distance. The metric is calculated using the row vectors of users from the ratings matrix based on items which were rated by all users in the comparison. A neighborhood of similar users is determined by either taking the k-nearest neighbors in the vector space or by using a similarity threshold determined by the user. This neighborhood of users is then used to either recommend a new item to user within the neighborhood who hasn’t interacted with the item but that other users in the neighborhood enjoyed. Additionally, the neighborhood rating for an item (e.g. the average rating of the users for an item) can be used to predict the rating that a user from the neighborhood would give to an item that he has not interacted with.

Item based collaborative filtering evaluates similarity between items and recommends items that are similar to prior items that a user enjoyed. This is done by creating an item to item similarity matrix using one of the aforementioned similarity measures. Instead of comparing rows of users (as in user based collaborative filtering), columns of items are evaluated for similarity. Predictions for ratings for an un-rated item for a user are made by taking the weighted average of similarities and ratings for other similar items that the user has rated.

In addition to comparing these algorithms, I will also compare performance by altering the number of neighbors the algorithms use to create groups of similar users/items. The standard value for UBCF is 25 and 30 for IBCF. I will compare these to values of 50 (broader neighborhood). Furthermore, I will also train these models with similarity measured by pearson correlation to see how it compares to the standard similarity measure of cosine similarity.

Evaluation Scheme

First, I normalize the ratings matrix. This is done to remove user bias by subtracting the average of every row from every observation within the row.

ratings_matrix_centered <- normalize(ratings_matrix, 'center')

Next, I create an evaluation scheme. This is an object in recommendation that stores instructions on how to split the data into training and test sets and evaluate results. Below I split the normalized ratings matrix via an 80/20 split. The given parameter is used in evaluating performance on the test set. The algorithm is provided the amount of items rated by the user in the test set corresponding to the given parameter, while the rest of the users ratings (total user’s ratings in the test set minus given parameter) are held out for comparing to model predictions. The difference between the predicted values and the actual values are used to calculate error metrics.

set.seed(123)
eval_scheme <- evaluationScheme(ratings_matrix_centered, method = 'split',
                                train = 0.8, given = 10)

UBCF

Below I train two user-based collaborative filtering models: one with the standard neighborhood of 25 and one with a neighborhood of 50.

ubcf25 <- Recommender(getData(eval_scheme, 'train'),'UBCF')
Warning in .local(x, ...): x was already normalized by row!
ubcf50 <- Recommender(getData(eval_scheme, 'train'),'UBCF',
                      parameter = list(nn=50))
Warning in .local(x, ...): x was already normalized by row!

Next, I compute the predicted ratings for the test data

ubcf25_pred <- predict(ubcf25, getData(eval_scheme, 'known'), type = 'ratings')

ubcf50_pred <- predict(ubcf50, getData(eval_scheme, 'known'), type = 'ratings')

Below I calculate the error:

error_table <- rbind(
  UBCF25 = calcPredictionAccuracy(ubcf25_pred, getData(eval_scheme,'unknown')),
  UBCF50 =  calcPredictionAccuracy(ubcf50_pred, getData(eval_scheme,'unknown'))
)

IBCF

Below I train two item-based collaborative filtering models: one with the standard neighborhood of 30 and one with a neighborhood of 50.

ibcf30 <- Recommender(getData(eval_scheme, 'train'),'IBCF')
Warning in .local(x, ...): x was already normalized by row!
ibcf50 <- Recommender(getData(eval_scheme, 'train'),'IBCF',
                      parameter =list(k=50))
Warning in .local(x, ...): x was already normalized by row!

Next, I compute the predicted ratings for the test data

ibcf30_pred <- predict(ibcf30, getData(eval_scheme, 'known'), type = 'ratings')

ibcf50_pred <- predict(ibcf50, getData(eval_scheme, 'known'), type = 'ratings')

Below I calculate the error:

error_table <- rbind(error_table,
  IBCF30 = calcPredictionAccuracy(ibcf30_pred, getData(eval_scheme,'unknown')),
  IBCF50 =  calcPredictionAccuracy(ibcf50_pred, getData(eval_scheme,'unknown'))
)

Correlation

Below I train an item based and user based collaborative filtering algorithm where similarity is measured via pearson correlation

ubcfPearson <- Recommender(getData(eval_scheme, 'train'),'UBCF',
                           parameter = list(method = 'pearson'))
Warning in .local(x, ...): x was already normalized by row!
ibcfPearson <- Recommender(getData(eval_scheme, 'train'),'IBCF',
                      parameter = list(method = 'pearson'))
Warning in .local(x, ...): x was already normalized by row!

Next, I compute the predicted ratings for the test data

ubcfPearson_pred <- predict(ubcfPearson, getData(eval_scheme, 'known'), type = 'ratings')

ibcfPearson_pred <- predict(ibcfPearson, getData(eval_scheme, 'known'), type = 'ratings')

Below I calculate the error:

error_table <- rbind(error_table,
  UBCFPearson = calcPredictionAccuracy(ubcfPearson_pred, getData(eval_scheme,'unknown')),
  IBCFPearson =  calcPredictionAccuracy(ibcfPearson_pred, getData(eval_scheme,'unknown'))
)

Random

Last, I repeat the above steps for a random model, which will be used for baseline comparison purposes

set.seed(123)
random <- Recommender(getData(eval_scheme, 'train'),'Random')

random_pred <- predict(random, getData(eval_scheme, 'known'), type = 'ratings')


error_table <- rbind(error_table,
  Random = calcPredictionAccuracy(random_pred, getData(eval_scheme,'unknown'))
)

Summary

error_table <- error_table |> as.data.frame() |> rownames_to_column('Model')
error_table_long <- error_table |> pivot_longer(!Model, names_to = 'Metric',
                                                values_to = 'Value')
error_table_long$Model <- factor(error_table_long$Model,
                                 levels = c('IBCF30','IBCF50',
                                            'UBCF25','UBCF50',
                                            'UBCFPearson','IBCFPearson',
                                            'Random'))
error_table_long |> ggplot(aes(x=Metric, y = Value, fill = Model)) +
  geom_bar(position = 'dodge', stat = 'identity') +
  ggtitle('Error Metrics by Model') + theme_minimal()

error_table
        Model     RMSE      MSE       MAE
1      UBCF25 1.127553 1.271376 0.8637211
2      UBCF50 1.092479 1.193510 0.8384683
3      IBCF30 1.131471 1.280228 0.8270833
4      IBCF50 1.118631 1.251335 0.8283588
5 UBCFPearson 1.060218 1.124062 0.8168008
6 IBCFPearson 1.136539 1.291721 0.8275854
7      Random 2.234631 4.993576 1.8625028

As seen above, both user-based collaborative-filtering and item-based collaborative-filtering performed better than the random model. The best performing model was user-based collaborative filtering with similarity measured via Pearson correlation. Furthermore, using larger neighborhoods result in slightly better performance. That said, performance among all of the UBCF and IBCF models was fairly similar. It is possible that changing neighborhood sizes more substantially (e.g. 100 neighbors) may have resulted in more differentiated performance (possibly worse since similarity between the users or items becomes diluted).

One challenge for collaborative based filtering algorithms is the cold-start problem, which is finding recommendations for new users or new items which have few ratings. The user with the least amount of ratings in this data set had 20 ratings and many users had more ratings. However, some movies had as few as one rating which could make it difficult to get an accurate sense of similarity for movies with few reviews. This could be one of the reasons by the user-based algorithms performed slightly better. Additionally, incorporating content based filtering could allow for a more comprehensive assessment of movie similarity as movies could be compared on features such as genres or descriptions rather than ratings.

Number of ratings per user:

ratings |> group_by(userId) |> summarise(num_ratings_per_user = n()) |> select(num_ratings_per_user) |> psych::describe()
                     vars   n  mean     sd median trimmed   mad min  max range
num_ratings_per_user    1 610 165.3 269.48   70.5  105.32 64.49  20 2698  2678
                     skew kurtosis    se
num_ratings_per_user 4.61    29.72 10.91

Number of ratings per movie:

ratings |> group_by(movieId) |> summarise(num_ratings_per_movie = n()) |> select(num_ratings_per_movie) |> psych::describe()
                      vars    n  mean   sd median trimmed  mad min max range
num_ratings_per_movie    1 9724 10.37 22.4      3    5.06 2.97   1 329   328
                      skew kurtosis   se
num_ratings_per_movie 5.24    38.94 0.23

Future research would include parameter tuning on model parameters such as neighborhood size, evaluating different similarity metrics (e.g. Jaccard). Additionally, the Item-Feature matrix was not utilized but would have been necessary in implementing a content-based algorithm; additional research can compare the above algorithms to a content-based algorithm.