A Reinforcement Learning Movie Recommendation System Approach using TensorFlow Agents

Luz Melo

05/01/2024

Abstract

In this project, we explore the application of reinforcement learning in movie recommendation system using TensorFlow-agents Bandit Library. In particular, we will deploy a contextual bandit algorithm building upon Google’s step-by-step demo, which showcases how to build the MovieLens recommendation system using TensorFlow Agents (TF-Agents). Contextual bandit algorithms are well-suited for recommendation systems where the optimal action (movie recommendation) depends on contextual information (user features). Using this demo as a base code, this project leverages the TF-Agents framework to implement and evaluate the performance of two algorithms: Linear Upper Confidence Bound and Linear Thompson Sampling. The primary objective of this research project is to identify optimal settings for maximizing recommendation quality while minimizing regret, which serves as a quantifiable measure of the opportunity loss resulting from the agent’s decisions in comparison to an optimal policy. This research project aims to provide a high-level introduction to the operation of contextual bandit algorithms and serve as a comprehensive guide for both practical implementations and research exploration within recommendation systems.

Introduction

The MovieLens dataset is a widely used benchmark dataset in the field of recommendation systems. It contains user ratings for movies, along with contextual information such as movie genres, release years, and user demographics. This datasets were collected by the GroupLens Research Project at the University of Minnesota.

This data set consists of:

100,000 ratings (1-5) from 943 users on 1682 movies.
Each user has rated at least 20 movies.
Simple demographic info for the users (age, gender, occupation, zip)

We preprocess the dataset to extract relevant features and prepare it for use in training our recommendation model. To perform our RL model, we graph 80% of the oldest ratings of each participants, to guarantee that we have the newest 20% of the data for testing. A visualization of the proportion of ratings for the training and testing datasets can be seen in figure 1 below:

This is important to keep in mind as the recommendation algorithm might become biased towards the preferences of users who rate more frequently, as they provide more data to the system. Also, if users are more likely to rate movies they liked (leading to a higher average rating), the system may need to adjust for the fact that the absence of a rating doesn’t necessarily indicate disinterest or dislike.Not only this, but in future work we could look at if users from different occupations might have different movie preferences and understanding these patterns can help in providing more personalized recommendations.

Data and Methods

Environment Dynamics

For this project, we defined a MovieLens-specific bandits environment Here is a quick overview of the environment dynamics:

Action: Recommending a movie.
Observation/State: Vector representing a user’s preferences.
Reward: Expected rating that the user would give to the recommended movie.

In this environment each observation is a vector with dimensions equal to the number of latent features. If dimention of the context is 20, then the observation space will have a shape of (20,), indicating that every observation vector the agent receives will have 20 elements. Using this, we allocates a zero-initialized matrix to store observation data for all users in the current batch, where batch_size are the number of rows/users and context_dim are the columns/movies.

Following this, one of the first step in initiazing this environment is to compute a data matrix, or user-movie ratings matrix, which is a structured way of representing the ratings that users have given to movies. More precisly, in the context of our MovieLens dataset we have that

Rows represent individual users.
Columns represent individual movies.
Entries in the matrix represent the rating given by a user to a movie. If a user has not rated a movie, the corresponding entry is a zero.

Initially, the data_matrix contains ratings for 1682 movies across 943 users. However, our baseline model data matrix will only contain ratings for the first 20 movies. That is, the dimensions of the matrix will be reduced from 943 x 1682 to 943 x 20 for the base model. This operation is particularly useful since we wish to limit the complexity of the environment, as it is more computational efficient by simplifying the problem space for a learning model. The purpose of this matrix is to perform singular Value Decomposition (SVD), which is a matrix factorization technique that aids with dimensionality reduction. SVD decomposes a matrix, in this case a data matrix (or user-movie ratings matrix), into three distinct matrices:

A=(u)(s)(vh)

u: Contains the left singular vectors (orthogonal matrix).
s: Diagonal matrix containing the singular values, which are non-negative and usually arranged in descending order.
vh: Contains the right singular vectors (transpose of v, another orthogonal matrix).

Here are two main reasons why this is important:

1. Extracting Important Features: SVD breaks down the user-movie matrix into component matrices that highlight the underlying structures in the data. By decomposing the matrix into singular vectors and singular values, SVD helps identify and separate the most influential features of users and movies. This is crucial because it allows the system to focus on significant relationships between users and movies, ignoring the noise or less critical data.

2. Reducing Complexity: In large datasets like Movielens with many users and movies, the full user-movie matrix can be extremely large and sparse (filled predominantly with zeros due to many users not having rated most movies). SVD helps reduce the complexity of the problem by producing a lower-dimensional representation of this matrix. This compressed representation retains the most meaningful interactions (as reflected by the singular values), making computations more manageable and efficient. Visit linalg.svd to learn more this function and its purpose.

Next, we have rank_k, which is the number of singular values (and corresponding singular vectors) retained from the decomposition process to approximate the original matrix. The value of rank_k is effectively the dimensionality to which the data is reduced. It represents the number of largest singular values (and their corresponding singular vectors in u and v) that are retained to create a reduced approximation of the original matrix A. By choosing rank_k singular values, we are essentially saying that we believe these rank_k features capture the most critical aspects of our data. In order to determine what rank_k to use, we based ourselves on the variance explained by the singular values. The idea is to choose rank_k such that the sum of the squares of the singular values up to rank_k accounts for a desired percentage of the total sum of squares of all singular values. This percentage is often set between 80% to 90%, depending on how much of the data’s variance one wishes to capture.

Trade-off Between Complexity and Accuracy:

Selecting a rank_k involves balancing the complexity of the model (in terms of computational cost and memory usage) against the accuracy or fidelity of the matrix approximation. A higher rank_k means a more accurate approximation of the original matrix but requires more computational resources. In our base model, we choose rank_k = 20, which is using all possible singular values for the matrix decomposition. Essentially, this is not reducing the dimensionality of the data at all in terms of the number of singular values used. The full SVD in this case will capture all the variability in the 20-dimensional item space. This is usefull for the following reasons:

No Information Loss: By choosing rank_k = 20, you ensure that no information is lost in the factorization process.
Optimal Reconstruction: Using rank_k = 20 means the matrix reconstructed from the SVD components will exactly match the original data_matrix_A. This is particularly useful for understanding the underlying structure of the data without losing any detail.

Agent

In this project, we define a Linear UCB agent. The LinearUCBAgent is a specific implementation of the Upper Confidence Bound (UCB) algorithm for linear contextual bandit problems within the TensorFlow Agents (TF-Agents) framework. This agent aims to solve problems where the optimal action (recommending a movie) depends on the context (user features) provided at each step. Here is how this agent interacts with our environment:

Context and Action Spaces:

Environment: It provides a state or context in each timestep. In the case of the MovieLens environment, the context is a feature vector representing user preferences derived through SVD of user-movie ratings data.
Agent: It uses these feature vectors (contexts) to decide which action/movie to take/recommend. The decision is based on maximizing the expected reward, calculated using a confidence interval around the estimated rewards for each action.

Exploration vs. Exploitation:

The LinearUCBAgent implements an exploration strategy where it balances between exploiting the best-known action and exploring new actions that might lead to higher rewards. The alpha parameter controls the degree of exploration. For more information about UCB algorithm please refer to the original paper or refer to Chapter 3 of “Mastering Reinforcement Learning with Python” by Enes Bilgin.

Regret Metric

In a similar note, we use the regret metric is a metric that measures the difference between the total reward obtained by the bandit agent and the total reward that would have been obtained by always choosing the optimal action at each time step.

Results

To evaluate the policies, we perform a comparison of cumulative regret (CR) for different hyperparameter settings, measured for both the Linear Upper Confidence Bound (LinUCB) and Linear Thompson Sampling (LinTS) algorithms. In table 1, each corresponds to a specific hyperparameter setting variation, labeled as v1 through v6. The “Baseline Parameters” row represents the initial hyperparameter values used as a reference point for comparison. Based on the results, we can observe the impact of varying each hyperparameter on the cumulative regret of both algorithms. For instance, when the batch size is reduced to 1 (highlighted in orange), the cumulative regret significantly increases for both LinUCB and LinTS compared to the baseline. This means could be due to the fact that, since the agent is only choosing one action at a time it is thus limiting its ability to explore different options effectively, which might be leading to sub-optimal solutions.

On the other hand, we notice that increasing the steps per loop to 10 leads to a reduction in cumulative regret, which could be easily misinterpreted as an improved performance for both algorithms. However, if we saw the values we see that the regret decreased very abruptly in the first few steps, and then alternated between high and low regrets. The variations in the rank K and number of actions also affect the cumulative regret differently for LinUCB and LinTS. Decreasing the rank K to 12 results in a decrease in cumulative regret for both algorithms, while reducing the number of actions to 12 leads to a significant decrease in cumulative regret for LinTS but not for LinUCB. While I am not completly sure why this might be the case, this is worth exploring in future work.

Lastly, changes in the Tikhonov weight and agent alpha hyperparameters also influenced the cumulative regret differently. Increasing the Tikhonov weight to 0.1 leads to a notable increase in cumulative regret for both algorithms, while increasing the agent alpha to 1.0 results in a decrease in cumulative regret for LinUCB but an increase for LinTS. This results highlight the sensitivity of the algorithms and how their values can potentially impact the performance in movie recommendation system.

Conclusion

In this project, we explored the application of reinforcement learning (RL) in movie recommendation systems using the TensorFlow Agents (TF-Agents) Bandit Library. Using the simulation environment, we aimed to address the challenge of recommending movies tailored to individual user preferences. The MovieLens environment served medium for visualizing users and their movie preferences through matrix factorization techniques. By generating user vectors and approximate ratings, the environment enabled us to define the RL problem of minimizing the regret metric through an experimentation with various hyperparameters, we evaluated the performance of two algorithms: Linear Upper Confidence Bound (LinUCB) and Linear Thompson Sampling (LinTS). Despite this project is an offline problem, these contribute to the understanding of contextual bandit algorithms in real-world recommendation systems. Additionally, we discussed the some of the challenges that arise from trying to evaluate these systems and encourage others an open-science approach to address these issues effectively.