DATA643 - Project 6: Summary of Bayesian Low-Rank Determinantal Point Processes

1 - Introduction

In this weeks project we will be summarizing a talk from the 10th ACM Conference on Recommender Systems (RecSys’16). We have selected chosen to select the presentation of Bayesian Low-Rank Determinantal Point Processes because this topic is brand new and we are curious about the application of Bayesian methods in Recommender Systems. The original paper can be found on the ACM Digital Library however the full paper requires a membership to access. The domain for this talk is the application of the system to an online shopping basket and making recommendations of products. This method can also applied to a number of different recommenders and, according to the authors, performs better than a number of ‘state of the art’ methods.

2 - Determinational Point Process

The determinational point process (DPP), measures the probability of all possible point processes or subsets of a ground base set. For example it would measure the probability of all possible shopping baskets over a ground base sets of all possible items in the store. According to the presentation the DPP measures the probability of getting a certain amount of diversity in a set across all of the possible subsets.

DPP was developed in the statistical physics community by Macchi in 1975 and is also used in combinatorics and random matrix theory. DPP’s are also being studied for their application to machine learning problems. The application to recommender systems is based on the following:

Every item is represented by an item trait vector.
The magnitude of the item trait vector can be thought of as the item popularity.
The angle between trait vectors is the item similarity.
The probability of getting a specific subset or basket is the volume of space that is spanned by the item trait vectors.
The model then attempts to maximize the area of all possible subsets.

2.1 - Formal Definition

For a discrete base set (item catalog) \(\mathcal{Y} = \{x_{1}, x_{2}, \dots, x_{M} \}\). A DPP defined by a \(M \times M\) positive semi-definite kernel matrix \(L\) is a probability measure on the \(2^M\) possible subsets \(Y\) in \(\mathcal{Y}\):
\[ P_{L}(Y) = \frac{det(L_Y)}{det(L+I)} \]

The diagonal of \(L_{ij}\) captures the popularity
The off-diagonal entries \(L_{ij} = L_{ji}\) measures the item similarity between \(i\) and \(j\)

2.2 - Issues with the DPP model

The traditional DPP model has a number of issues that make it concerning for large data sets:

All inference calculations are \(O^3\) so they are very computationally expensive.
The gradient ascent learning method typically fails due to the project of the gradient to a full raked DPP kernel.
Memory Consumption is quadratic to the size of the catalog.

2.3 - Proposed Low Rank Method

In the presentation the method of lowering the rank of the matrix is proposed to be as follows: \[ L = VV^T \] Where \(V\) is the item trait matrix where each row is an item trait vector and the number of item traits selected i a hyper-parameter that can be manipulated.

2.4 Baysian Method for Weighting the Rows of \(V\)

In the presentation they note that the model is very sensitive to the weighted values of each of the vectors. To improve the ability of finding these weights the authors proposed a Bayesian methodology for finding these weights:

Assign a multivariate Gaussian to each item trait vector in \(V\)
The posterior likelihood is then calculated from the conditional distribution using the original probability from the formal definition.
A log-likelihood is then performed.
The learning is then performed using a stochastic gradient Hamiltonian Monte Carlo (SGHMC)[Chen 2014] (This could be a summary all on it’s own!).
- This method is a more efficient version of a Markov Chain Monte Carlo process.
- Uses the gradient of the log-posterior to efficiently explore the posterior state space.
- Computes noisy gradients from batches of the training data like many systems in Big Data.
- Uses a friction term to ensure convergence.

3 - Real World Performance

The real world application that the authors describe is the idea of having a shopping basket filled with items A, B, and C and we would like to know a good item D to show at checkout that the user may also want to add to the cart. The recommender was tested on the Amazon Baby Registry Data Set, a Microsoft Store data set consisting of 243,000 checkout baskets and finally the Belgian Retail supermarket data set consisting of 88,000 shopping baskets and 16,000 items. The performance of the DPP method was compared to the following systems:

A full rank DPP system
A Poison Factorization system
A RecoMF system that is used to do recommendations for the XBox Live store
An Associative Classifier

Finally the system was compared to these systems using a couple of measures. The first is mean percentile rank (MPR) where a 100% indicates perfect predictions and the system performed equal to or better then the rest of the systems. The other metric used was precision and this also was competitive with the above mentioned systems.