From the video, I found the Matrix Factorization concept the most interesting to better understand the fundamental mathematical underpinnings of Recommendation solution approaches. I did some research trying to find the simplest way to explain the concept.

Matrix factorization for recommender systems

Basic matrix factorization assumptions:

Each user can be described by k attributes or features. For example, feature 1 might be a number that says how much each user likes sci-fi movies.

Each item (movie) can be described by an analagous set of k attributes or features. To correspond to the above example, feature 1 for the movie might be a number that says how close the movie is to pure sci-fi.

If we multiply each feature of the user by the corresponding feature of the movie and add everything together, this will be a good approximation for the rating the user would give that movie.

The beauty is that we do not know what these features are. Nor do we know how many (k) features are relevant. We simply pick a number for k and learn the relevant values for all the features for all the users and items. How do we learn? By minimizing a loss function.

We can turn our matrix factorization approximation of a k-attribute user into math by letting a user u take the form of a k-dimensional vector x_u. Similarly, an item i can be k-dimensional vector y_i. User u’s predicted rating for item i is just the dot product of their two vectors.

\[ \hat{r}_{ui} = x^T_u . y_i = \sum\limits_{k} x_{uk}y_{ki}\]

where \(\hat{r}_{ui}\) hat represents our prediction for the true rating \(\hat{r}_{ui}\), and \(y_i\) (\(x^T_u\)) is assumed to be a column (row) vector. These user and item vectors are often called latent vectors or low-dimensional embeddings in the literature. The k attributes are often called the latent factors. We will choose to minimize the square of the difference between all ratings in our dataset (S) and our predictions. This produces a loss function of the form

\[ L = \sum\limits_{u,i\in S} (r_{ui} - x^T_u . y_i)^2 + \lambda_x \sum\limits_{u} ||x_{u}||^2 + \lambda_y \sum\limits_{u} ||y_{i}||^2\]

Note that we’ve added on two L2 regularization terms at the end to prevent overfitting of the user and item vectors. Our goal now is to minimize this loss function. Derivatives are an obvious tool for minimizing functions, the two most popular derivative-based methods are: Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD)