library(tidyverse)
library(recommenderlab)
library(superml)
DATA612Project2
Project 2
Project 2 requires implementing two of content-based filtering, user-user collaborate filtering, or item-item collaborative filtering models on a data set of choice.
I will work with the MovieLens small data set, saved in my Github repository.
Below I load the necessary libraries and import the data set
<- read_csv('https://raw.githubusercontent.com/stoybis/DATA612/refs/heads/main/Project2/movies.csv')
movies
<- read_csv('https://raw.githubusercontent.com/stoybis/DATA612/refs/heads/main/Project2/ratings.csv') ratings
Exploratory Analysis
head(movies)
# A tibble: 6 × 3
movieId title genres
<dbl> <chr> <chr>
1 1 Toy Story (1995) Adventure|Animation|Children|Comed…
2 2 Jumanji (1995) Adventure|Children|Fantasy
3 3 Grumpier Old Men (1995) Comedy|Romance
4 4 Waiting to Exhale (1995) Comedy|Drama|Romance
5 5 Father of the Bride Part II (1995) Comedy
6 6 Heat (1995) Action|Crime|Thriller
head(ratings)
# A tibble: 6 × 4
userId movieId rating timestamp
<dbl> <dbl> <dbl> <dbl>
1 1 1 4 964982703
2 1 3 4 964981247
3 1 6 4 964982224
4 1 47 5 964983815
5 1 50 5 964982931
6 1 70 3 964982400
length(unique(movies$movieId))
[1] 9742
length(unique(ratings$userId))
[1] 610
length(unique(ratings$movieId))
[1] 9724
There are more movies than there are rated movies - thus I will subset the movie data for movies that were rated
<- movies |> filter(movieId %in% ratings$movieId) movies
Below is the distribution of ratings:
|> ggplot(aes(x = '', y= rating)) + geom_boxplot() +
ratings stat_summary(fun = mean, geom = 'point', shape = 4, size = 5, color = 'red') + ggtitle('Distribution of Movie Ratings') + theme_minimal()
Below is the distribution of the amount of movies that each viewer has rated:
|> group_by(userId) |> summarise(num_ratings = n_distinct(movieId)) |>
ratings ggplot(aes(x = num_ratings, y ='')) + geom_boxplot() +
stat_summary(fun = mean, geom = 'point', shape = 4, size = 5, color = 'red') + ggtitle('Distribution of Ratings per User') + theme_minimal()
The average movie rating is 3.5 and there is a long tail of users who have a large amount of ratings, while the average is 165:
|> group_by(userId) |> summarise(num_ratings = n()) |> select(num_ratings) |> summary() ratings
num_ratings
Min. : 20.0
1st Qu.: 35.0
Median : 70.5
Mean : 165.3
3rd Qu.: 168.0
Max. :2698.0
Below is an image that shows the distribution of the number of ratings per movie:
|> group_by(movieId) |> summarise(num_ratings = n()) |>
ratings ggplot(aes(x = num_ratings, y ='')) + geom_boxplot() +
stat_summary(fun = mean, geom = 'point', shape = 4, size = 5, color = 'red') + ggtitle('Distribution of Ratings per Movie') + theme_minimal()
|> group_by(movieId) |> summarise(num_ratings = n()) |> select(num_ratings) |> summary() ratings
num_ratings
Min. : 1.00
1st Qu.: 1.00
Median : 3.00
Mean : 10.37
3rd Qu.: 9.00
Max. :329.00
Data Prep
First, I transform the ratings data into a matrix that can be used with recommenderlab functions
<- ratings |> select(-timestamp) |>
ratings_matrix pivot_wider(names_from = movieId, values_from = rating)
<- as.matrix(ratings_matrix[,-1])
ratings_matrix <- as(ratings_matrix, 'realRatingMatrix') ratings_matrix
ratings_matrix
610 x 9724 rating matrix of class 'realRatingMatrix' with 100836 ratings.
I now have a ratings matrix where each row is a user and each column is a movie and each viewer’s ratings for a movie are populated in the respective column. This is a sparse matrix (many values are NA) and only non-NA values are stored explicitly for efficient handling. Below is a snippet of the matrix
getRatingMatrix(ratings_matrix[1:7,1:7])
7 x 7 sparse Matrix of class "dgCMatrix"
1 3 6 47 50 70 101
[1,] 4.0 4 4 5 5.0 3 5
[2,] . . . . . . .
[3,] . . . . . . .
[4,] . . . 2 . . .
[5,] 4.0 . . . 4.0 . .
[6,] . 5 4 4 1.0 . .
[7,] 4.5 . . . 4.5 . .
Next, I create an item matrix which describes each item (movie) based on its features (genre). Features take 0 or 1 values if the genre applies to the movie. I use the CountVectorizer function from the superml package.
<- CountVectorizer$new()
cfv
<- cfv$fit_transform(movies$genres)
item_feature_matrix <- as(item_feature_matrix, 'binaryRatingMatrix') item_feature_matrix
item_feature_matrix
9724 x 23 rating matrix of class 'binaryRatingMatrix' with 23145 ratings.
head(as(item_feature_matrix, 'matrix'))
drama comedy thriller action romance adventure crime sci fi horror
1 FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
2 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
3 FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
4 TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
5 FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
6 FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
fantasy children animation mystery documentary war musical western imax
1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
2 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
noir film listed genres
1 FALSE FALSE FALSE FALSE
2 FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE FALSE
4 FALSE FALSE FALSE FALSE
5 FALSE FALSE FALSE FALSE
6 FALSE FALSE FALSE FALSE
Modeling
I will compare item based and user based collaborative filtering, examining which algorithm results in the lowest error when comparing actual ratings vs predicted ratings. Both models will be compared to the “RANDOM” model in the recommenderlab package, which generates random recommendations. This baseline is necessary to have a point of comparison in order to determine if the models are better than guessing.
User based collaborative filtering groups users into neighborhoods based on similarity of tastes. In other words, users who rate items similarly have similar tastes and thus a user can be recommended an item that other uses in the neighborhood enjoyed. Similarity can be defined via different metrics, such as cosine similarity (default), pearson correlation, or Jaccard distance. The metric is calculated using the row vectors of users from the ratings matrix based on items which were rated by all users in the comparison. A neighborhood of similar users is determined by either taking the k-nearest neighbors in the vector space or by using a similarity threshold determined by the user. This neighborhood of users is then used to either recommend a new item to user within the neighborhood who hasn’t interacted with the item but that other users in the neighborhood enjoyed. Additionally, the neighborhood rating for an item (e.g. the average rating of the users for an item) can be used to predict the rating that a user from the neighborhood would give to an item that he has not interacted with.
Item based collaborative filtering evaluates similarity between items and recommends items that are similar to prior items that a user enjoyed. This is done by creating an item to item similarity matrix using one of the aforementioned similarity measures. Instead of comparing rows of users (as in user based collaborative filtering), columns of items are evaluated for similarity. Predictions for ratings for an un-rated item for a user are made by taking the weighted average of similarities and ratings for other similar items that the user has rated.
In addition to comparing these algorithms, I will also compare performance by altering the number of neighbors the algorithms use to create groups of similar users/items. The standard value for UBCF is 25 and 30 for IBCF. I will compare these to values of 50 (broader neighborhood). Furthermore, I will also train these models with similarity measured by pearson correlation to see how it compares to the standard similarity measure of cosine similarity.
Evaluation Scheme
First, I normalize the ratings matrix. This is done to remove user bias by subtracting the average of every row from every observation within the row.
<- normalize(ratings_matrix, 'center') ratings_matrix_centered
Next, I create an evaluation scheme. This is an object in recommendation that stores instructions on how to split the data into training and test sets and evaluate results. Below I split the normalized ratings matrix via an 80/20 split. The given parameter is used in evaluating performance on the test set. The algorithm is provided the amount of items rated by the user in the test set corresponding to the given parameter, while the rest of the users ratings (total user’s ratings in the test set minus given parameter) are held out for comparing to model predictions. The difference between the predicted values and the actual values are used to calculate error metrics.
set.seed(123)
<- evaluationScheme(ratings_matrix_centered, method = 'split',
eval_scheme train = 0.8, given = 10)
UBCF
Below I train two user-based collaborative filtering models: one with the standard neighborhood of 25 and one with a neighborhood of 50.
<- Recommender(getData(eval_scheme, 'train'),'UBCF') ubcf25
Warning in .local(x, ...): x was already normalized by row!
<- Recommender(getData(eval_scheme, 'train'),'UBCF',
ubcf50 parameter = list(nn=50))
Warning in .local(x, ...): x was already normalized by row!
Next, I compute the predicted ratings for the test data
<- predict(ubcf25, getData(eval_scheme, 'known'), type = 'ratings')
ubcf25_pred
<- predict(ubcf50, getData(eval_scheme, 'known'), type = 'ratings') ubcf50_pred
Below I calculate the error:
<- rbind(
error_table UBCF25 = calcPredictionAccuracy(ubcf25_pred, getData(eval_scheme,'unknown')),
UBCF50 = calcPredictionAccuracy(ubcf50_pred, getData(eval_scheme,'unknown'))
)
IBCF
Below I train two item-based collaborative filtering models: one with the standard neighborhood of 30 and one with a neighborhood of 50.
<- Recommender(getData(eval_scheme, 'train'),'IBCF') ibcf30
Warning in .local(x, ...): x was already normalized by row!
<- Recommender(getData(eval_scheme, 'train'),'IBCF',
ibcf50 parameter =list(k=50))
Warning in .local(x, ...): x was already normalized by row!
Next, I compute the predicted ratings for the test data
<- predict(ibcf30, getData(eval_scheme, 'known'), type = 'ratings')
ibcf30_pred
<- predict(ibcf50, getData(eval_scheme, 'known'), type = 'ratings') ibcf50_pred
Below I calculate the error:
<- rbind(error_table,
error_table IBCF30 = calcPredictionAccuracy(ibcf30_pred, getData(eval_scheme,'unknown')),
IBCF50 = calcPredictionAccuracy(ibcf50_pred, getData(eval_scheme,'unknown'))
)
Correlation
Below I train an item based and user based collaborative filtering algorithm where similarity is measured via pearson correlation
<- Recommender(getData(eval_scheme, 'train'),'UBCF',
ubcfPearson parameter = list(method = 'pearson'))
Warning in .local(x, ...): x was already normalized by row!
<- Recommender(getData(eval_scheme, 'train'),'IBCF',
ibcfPearson parameter = list(method = 'pearson'))
Warning in .local(x, ...): x was already normalized by row!
Next, I compute the predicted ratings for the test data
<- predict(ubcfPearson, getData(eval_scheme, 'known'), type = 'ratings')
ubcfPearson_pred
<- predict(ibcfPearson, getData(eval_scheme, 'known'), type = 'ratings') ibcfPearson_pred
Below I calculate the error:
<- rbind(error_table,
error_table UBCFPearson = calcPredictionAccuracy(ubcfPearson_pred, getData(eval_scheme,'unknown')),
IBCFPearson = calcPredictionAccuracy(ibcfPearson_pred, getData(eval_scheme,'unknown'))
)
Random
Last, I repeat the above steps for a random model, which will be used for baseline comparison purposes
set.seed(123)
<- Recommender(getData(eval_scheme, 'train'),'Random')
random
<- predict(random, getData(eval_scheme, 'known'), type = 'ratings')
random_pred
<- rbind(error_table,
error_table Random = calcPredictionAccuracy(random_pred, getData(eval_scheme,'unknown'))
)
Summary
<- error_table |> as.data.frame() |> rownames_to_column('Model') error_table
<- error_table |> pivot_longer(!Model, names_to = 'Metric',
error_table_long values_to = 'Value')
$Model <- factor(error_table_long$Model,
error_table_longlevels = c('IBCF30','IBCF50',
'UBCF25','UBCF50',
'UBCFPearson','IBCFPearson',
'Random'))
|> ggplot(aes(x=Metric, y = Value, fill = Model)) +
error_table_long geom_bar(position = 'dodge', stat = 'identity') +
ggtitle('Error Metrics by Model') + theme_minimal()
error_table
Model RMSE MSE MAE
1 UBCF25 1.127553 1.271376 0.8637211
2 UBCF50 1.092479 1.193510 0.8384683
3 IBCF30 1.131471 1.280228 0.8270833
4 IBCF50 1.118631 1.251335 0.8283588
5 UBCFPearson 1.060218 1.124062 0.8168008
6 IBCFPearson 1.136539 1.291721 0.8275854
7 Random 2.234631 4.993576 1.8625028
As seen above, both user-based collaborative-filtering and item-based collaborative-filtering performed better than the random model. The best performing model was user-based collaborative filtering with similarity measured via Pearson correlation. Furthermore, using larger neighborhoods result in slightly better performance. That said, performance among all of the UBCF and IBCF models was fairly similar. It is possible that changing neighborhood sizes more substantially (e.g. 100 neighbors) may have resulted in more differentiated performance (possibly worse since similarity between the users or items becomes diluted).
One challenge for collaborative based filtering algorithms is the cold-start problem, which is finding recommendations for new users or new items which have few ratings. The user with the least amount of ratings in this data set had 20 ratings and many users had more ratings. However, some movies had as few as one rating which could make it difficult to get an accurate sense of similarity for movies with few reviews. This could be one of the reasons by the user-based algorithms performed slightly better. Additionally, incorporating content based filtering could allow for a more comprehensive assessment of movie similarity as movies could be compared on features such as genres or descriptions rather than ratings.
Number of ratings per user:
|> group_by(userId) |> summarise(num_ratings_per_user = n()) |> select(num_ratings_per_user) |> psych::describe() ratings
vars n mean sd median trimmed mad min max range
num_ratings_per_user 1 610 165.3 269.48 70.5 105.32 64.49 20 2698 2678
skew kurtosis se
num_ratings_per_user 4.61 29.72 10.91
Number of ratings per movie:
|> group_by(movieId) |> summarise(num_ratings_per_movie = n()) |> select(num_ratings_per_movie) |> psych::describe() ratings
vars n mean sd median trimmed mad min max range
num_ratings_per_movie 1 9724 10.37 22.4 3 5.06 2.97 1 329 328
skew kurtosis se
num_ratings_per_movie 5.24 38.94 0.23
Future research would include parameter tuning on model parameters such as neighborhood size, evaluating different similarity metrics (e.g. Jaccard). Additionally, the Item-Feature matrix was not utilized but would have been necessary in implementing a content-based algorithm; additional research can compare the above algorithms to a content-based algorithm.