Assignment 11 Code Base

Author

Long Lin

Overview

For this assignment, I created a personalized recommendation system using user-to-user collaborative filtering and my data set from assignment 2A. In order to do this, I used a library called recommenderlab because it provides a User Based Collaborative Filtering model or UBCF for short. The personalized recommendation system is used to predict a movie for each user that they are likely to enjoy. Afterwards, I evaluated the personalized recommendation system using a cross validation scheme and used that to draw my final conclusions.

Reading in the data

To start off, I grabbed the data from my previous assignment and loaded it in as a tribble.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

movies_df <- tribble(
  ~User,     ~Movie,                             ~Rating,
  "David",   "Oppenheimer",                      5,
  "David",   "Wicked",                           3,
  "David",   "Top Gun: Maverick",                4,
  "David",   "Zootopia 2",                       4,
  "Aaron",   "Oppenheimer",                      4,
  "Aaron",   "Top Gun: Maverick",                5,
  "Aaron",   "Zootopia 2",                       4,
  "Aaron",   "The Housemaid",                    4,
  "Josh",    "Oppenheimer",                      4,
  "Josh",    "Wicked",                           4,
  "Josh",    "Top Gun: Maverick",                3,
  "Josh",    "Zootopia 2",                       3,
  "Cameron", "Wicked",                           3,
  "Cameron", "Zootopia 2",                       5,
  "Cameron", "Captain America: Brave New World", 4,
  "June",    "Top Gun: Maverick",                3,
  "June",    "The Housemaid",                    5,
  "June",    "Captain America: Brave New World", 3
)

print(movies_df)

# A tibble: 18 × 3
   User    Movie                            Rating
   <chr>   <chr>                             <dbl>
 1 David   Oppenheimer                           5
 2 David   Wicked                                3
 3 David   Top Gun: Maverick                     4
 4 David   Zootopia 2                            4
 5 Aaron   Oppenheimer                           4
 6 Aaron   Top Gun: Maverick                     5
 7 Aaron   Zootopia 2                            4
 8 Aaron   The Housemaid                         4
 9 Josh    Oppenheimer                           4
10 Josh    Wicked                                4
11 Josh    Top Gun: Maverick                     3
12 Josh    Zootopia 2                            3
13 Cameron Wicked                                3
14 Cameron Zootopia 2                            5
15 Cameron Captain America: Brave New World      4
16 June    Top Gun: Maverick                     3
17 June    The Housemaid                         5
18 June    Captain America: Brave New World      3

Next, I converted the movies_df to a wide data format using pivot_wider.

movies_wide <- movies_df %>%
  pivot_wider(names_from = Movie, values_from = Rating) %>%
  column_to_rownames("User")

print(movies_wide)

        Oppenheimer Wicked Top Gun: Maverick Zootopia 2 The Housemaid
David             5      3                 4          4            NA
Aaron             4     NA                 5          4             4
Josh              4      4                 3          3            NA
Cameron          NA      3                NA          5            NA
June             NA     NA                 3         NA             5
        Captain America: Brave New World
David                                 NA
Aaron                                 NA
Josh                                  NA
Cameron                                4
June                                   3

Using recommenderlab

From there, I decided to use an existing recommender package called recommenderlab because it offered the User-based collaborative filtering (UBCF) algorithm that I was interested in.

In order to start using recommenderlab, I had to convert my wide data format to a specialized realRatingMatrix.

library(recommenderlab)

Loading required package: Matrix


Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loading required package: arules


Attaching package: 'arules'

The following object is masked from 'package:dplyr':

    recode

The following objects are masked from 'package:base':

    abbreviate, write

Loading required package: proxy


Attaching package: 'proxy'

The following object is masked from 'package:Matrix':

    as.matrix

The following objects are masked from 'package:stats':

    as.dist, dist

The following object is masked from 'package:base':

    as.matrix

rating_matrix <- as(as.matrix(movies_wide), "realRatingMatrix")

print(rating_matrix)

5 x 6 rating matrix of class 'realRatingMatrix' with 18 ratings.

Next, I created a UBCF (User-Based Collaborative Filtering) recommender model using recommenderlab and my data from 5 users. This looks for users with similar movie tastes to David, Aaron, etc.

recommender_model <- Recommender(data = rating_matrix, method = "UBCF")

print(recommender_model)

Recommender of type 'UBCF' for 'realRatingMatrix' 
learned using 5 users.

Next, I used the recommender_model to predict the top unseen movie that each user would likely enjoy.

predictions <- predict(recommender_model, rating_matrix, n = 1)

prediction_list <- as(predictions, "list")
names(prediction_list) <- rownames(rating_matrix)

print(prediction_list)

$David
[1] "The Housemaid"

$Aaron
[1] "Wicked"

$Josh
[1] "The Housemaid"

$Cameron
[1] "The Housemaid"

$June
[1] "Oppenheimer"

Here are the results of the personalized recommendation system using a UBCF (User-Based Collaborative Filtering) model. Although it looks like the system works, I believe my data set is too small for it to work effectively since it recommends the same movie to three different users.

Evaluating the Recommender System

In order to evaluate the recommender system, I initialized a 4-fold cross validation scheme because my dataset is small with only 18 ratings. A 4-fold cross validation scheme splits the data into 4 folds or equal parts and then tests it against the other 3 folds for a total of 4 tests.

eval_scheme <- evaluationScheme(data = rating_matrix, 
                                method = "cross-validation", 
                                k = 4, 
                                given = 1, 
                                goodRating = 4)

print(eval_scheme)

Evaluation scheme with 1 items given
Method: 'cross-validation' with 4 run(s).
Good ratings: >=4.000000
Data set: 5 x 6 rating matrix of class 'realRatingMatrix' with 18 ratings.

Next, I ran the evaluation using the eval_scheme and calculated the accuracy.

eval_recommender <- Recommender(getData(eval_scheme, "train"), 
                                method = "UBCF", 
                                parameter = list(nn = 1))

eval_prediction <- predict(eval_recommender, getData(eval_scheme, "known"), type = "ratings")

eval_accuracy <- calcPredictionAccuracy(eval_prediction, getData(eval_scheme, "unknown"))
print(eval_accuracy)

RMSE  MSE  MAE 
 NaN  NaN  NaN

The results were all NaN most likely due to the fact that the dataset was too small and there was not enough overlap between the users.

Conclusion

The personalized recommender system using a User Based Collaborative Filtering model was able to generate a prediction for each user but when evaluating the model and the error metrics, it seems like the system was not robust enough because it lacked data. In order to improve the system, I believe more data is a must in order to properly generate predictions based off of user overlap.