The alternating least squares algorithm is used through out the industry of recommmender systems. For example user browsing Amazon, or a subscriber to Netflix wanting to explore new movies, here the goal is to assign this user a list of items he or she may find relevant. I will focus on applying the ALS to specific type of data called, implicit feedback data
Implicit Feedback Data is a type of data for which the interactions between the users and the items are not contained (explicitly) in the dataset. A simple example of this, and one we will explore, is users and music listening history. Suppose we have a dataset (we do we will get it later) for which we have a list of users, and for each an artist with play counts. Concretely say we have user “143567” and he listened to “Radiohead” 12 times, the data set can contain multiple entries of this user, showing other artist he listened to, along with a play count. A dataset like this is implicit because all we have are Users and counts for artist they listened to.
The ALS algorithm lies within a method for recommending called Collaborative Based Filtering(CBF). Formally, CBF models from a users past activity and behavior, in our case listening counts, and then tries to recommend an item based on the similarity to other users. In the same field of recommending systems there exist another broad set of algorithms called Content Based Filtering, these methods rely on attributes of an item to find other similar items to recommend. This of course requires some processing to determine such attributes, and most of the time this is proprietary knowledge that companies keep to themselves.
Within the CBF, the approach taken by the ALS is one known as Matrix Factorization models. This type of approach tries to reveal and explain the interactions in the large data set with just a small number of latent factors (implicit features). You can read more of the topic on the Wikipedia page Factor Analysis.
Let us now grab the data and move on to the Matrix Factorization approach.
The dataset can be downloaded from http://bit.ly/1KiJdOR.
The data consists of three files.
user_artist_data.txt contains a list of users identified by a unique ID followed by an artist identified by ID and the number of times this user played this artist.
artist_data.txt contains the artist name associated with the id used in user_artist_data.txt
artist_alias.txt is a helper file, that identifies commonly misspelled artist along with the canonical artist name ID.
If you are on a computer with bash shell I recommend a quick glimpse at these with
$ cat user_artist_data.txt | less
Reading the README file can help a lot as well.
The approach taken is as follows.
We will represent our data as a (very) large matrix \(A\), say of dimensions \(m,n\). We impose no size on these, but the usual case is that the number of items (artist) outnumbers the number of users, and generally \(m << n\). In our case rows of \(A\) represent users, and the columns of \(A\) correspond to the artist. So we have that \(A_{ij}\) the \((i,j)^{th}\) entry of the matrix depicts the the number of times the \(ith\) user played the \(jth\) artist. It goes without saying that \(A\) is sparse, since most users will have only listened to a fraction of all the artist in the data.
The goal of Matrix Factorization Models is to approximate \(A\) with two smaller matrices, \(U\) and \(V\), each of these matrices representing the rows and columns of \(A\) respectively (users and artist). Formally \(U\) will be a matrix of users \(u\) each described in \(k\)-dimensions. Similarly \(V\) will be the matrix of artist described in \(k\)-dimensions.
\[U_{k\times m} = \begin{bmatrix} \uparrow &\uparrow &\cdots &\uparrow\\ u_1 &u_2 &\cdots &u_m \\ \downarrow &\downarrow &\cdots &\downarrow \end{bmatrix}\;\; V_{k\times n} = \begin{bmatrix} \uparrow &\uparrow &\cdots &\uparrow\\ v_1 &v_2 &\cdots &v_n \\ \downarrow &\downarrow &\cdots &\downarrow \end{bmatrix} \]
The vectors \(u_i\) and \(v_i\) are called factors (sometimes latent factors), moreover \(k\) is usually picked to be something small, and certainly \(k << m,n\). Here \(k\) represents features we believe associate each user to the artist, and so entries in the row vectors of \(U\) and \(V\) express how much association each has with the \(k\) features.
We then predict the “rating” of user \(u_i\) of artist \(v_j\) to be \[r_{ij} = u_i^Tv_j\]
And so our aim to is to approximate (and complete) \(A\) as follows: \[A \approx U^TV\]
Our attention now switches to find the matrices \(U\) and \(V\) that best approximate \(A\).
As mentioned above we seek to express \(A\) as a product of \(U\), and \(V\). Each of these representing users and artist in a \(k-\)dimensional feature space respectively. Unfortunetly this cannot be done directly, and and for this reason we introduce the Least Squares Algorithm.
Let us use the methods developed with a simple example.
library(Matrix)
# to create a sparse matrix we give it i, j index where to
# fill in the non-zero elements
a_i <- sample(1:4, size = 40, replace = TRUE)
a_j <- sample(1:40, size = 40, replace = TRUE)
# the non-zero elements
x_fill <- sample(1:10, 40, replace = TRUE)
A <- sparseMatrix(a_i, a_j, x = x_fill)
print(A)
## 4 x 35 sparse Matrix of class "dgCMatrix"
##
## [1,] . . 9 . 5 2 . . . . . . . . 8 . . . . 13 . 2 8 1 . . . . . . .
## [2,] . . . . 8 10 . . . . . 10 . 9 10 6 . . . . . . . . . . . . 6 . .
## [3,] 2 . . 10 . . . . . . 7 . . . . 9 2 . 5 . . 11 . . . . . 2 10 . .
## [4,] . . 6 . 8 . . 5 5 . . 10 . . . . . . . 7 . . . . 9 . . . 7 . .
##
## [1,] . . . .
## [2,] . . . 17
## [3,] . . . .
## [4,] 12 . . .
Note: We say that we are completing \(A\) since we start with a sparse matrix \(A\), but product the \(U^TV\) will yield a dense, matrix with non zero entries where \(A\) originally had them.