Youtube Recommender Analysis & Discussion

This is to serve as a personal understanding of the features and make-up of the youtube ‘algorithm’ that is oft mentioned by content creator and consumer alike. Much of this will be paraphrasing from the paper posted, which is linked a bit below this. Feel free to read. I hope it will be helpful. All images are pulled from the document.

Note: Anything with an italics has a note at the bottom with potential new words or further understanding.

Scenario Design

Who are the target users of the recommender system? What is the goal? How can it accomplish this?

Target Users: Any and all viewers of youtube content
Pick content related to their current interest or mood (funny, knowledge, topic-related, music, etc.)
Identify their patterns, interests, demographics, routines, and habits and use those to match with content that matches factors aligning with those user characteristics

Reverse Engineering

Firstly, I’m drawing the majority of this info from the following source(s):

Deep Neural Networks for Youtube Recommendations by Paul Covington, Jay Adams, and Emre Sargin

I must say that the conclusions section of the paper is so well written in an easily understandable form that I highly recommend reading it first. I will paraphrase as best I can in this written understanding of their recommender system at the end, but again highly recommend individuals read it for themselves. I have no doubt that there will be concepts that are either mis-represented or not shown. View this as a personal understanding and takeaway discussion starter. If there are any mis-representations, feel free to comment or contact me and let me know! I’d love to learn more.

Overview of the Model

The goal of youtube’s recommendation feed is to suggest the next most likely watched video to the user. That may be a bit un-intuitive as for something to be the most likely watched, it is predicated on the fact that it is watched, but the paper gets into that. To accomplish this recommendation, there are two neural networks at play. One for generation of candidates. Another for ranking said candidates.The candidate generation network utilizes user history to populate hundreds of candidates. The candidates are meant to be generally relevant by use of collaborative filtering and are then handed off to be ranked. The ranking network assigned a score to each video by a function of features that describe the user and video. The highest ranked are then presented. This two pronged approach allows the ability to filter down from millions of videos to a curated set of ranked suggestions.

Modelling Recommendation

The problem of recommendation is, in their words, a “multiclass classification where the prediction problem becomes accurately classifying a specific video watch, W, at time, T, among millions of videos, I (classes) ,from a corpus, V, based on a user, U, and context, C

       If the context and embedding pair values are the solution, how does Youtube’s model accurately create predictions? It does this by creating an “embedding”, which can be thought of as a categorical indicator of both the user and the video, in the sense that if they’re both in the same cateogry, it’s more likely the video and user will have a higher context rating and likelihood of being a good match for a user to view the video.
       Context, C, mentioned above, is dependent on creating “embeddings” of user history/content selection and the “embeddings” of video content. These embeddings can be used to create pairs between user and video to give a base for evaluating appropriate content. Classification occurs by the neural network as functions of “the user’s history and context that are useful for discriminating among videos with a softmax classifier”, by using the embeddings described earlier to generate mathematical values to quantify by.
       The model learns how to classify via implicit feedback (watch time, clicks, subscribed channels, etc.) and NOT explicit feedback(likes, dislikes). This was due to the significantly larger amount of implicit feedback data Youtube had over explicit.

Training the Model

Now that we know how to present the problem to the model, how does one train a model to classify videos in accordance with the user? With the never ending stream of video content uploaded by users, the first task is to filter. Recall that for Youtube, it was chosen to break this problem down into generating candidates and ranking those candidates.

In general, this graphic shows that the model pulls millions of videos for candidate generation, which are filtered down by using user history and context, and is then filtered again by adding video features and accounting for other candidate sources (ie. watched and was not recommended by the recommender) to create ranks. These videos are then shown to the user in a manageable list by rank. So let’s dive into how candidate generation is executed.

Candidate Generation

       User History and Context are the main drivers for filtering in Candidate Generation and Ranking of videos from the corpus. By teaching the model how to filter videos and classify, it can then classify videos in relation to likelihood of being watched. resulting in a candidate with a higher recommendation. But how did they train the model to filter candidates correctly?
       The answer is feeding a feedforward neural net the previously mentioned video embeddings. Using a technique to sample negative classes (aka videos that received a negative value in classification due to not being clicked by a user) and then corrected by importance weighting, they could train the model to filter efficiently. The technique used can be read more about here: On Using Very Large Target Vocabulary for Neural Machine Translation.
       Commonly used approaches such as traditional softmax and hierarchical softmax were shown to be less accurate or much more time consuming. Hierarchical softmax degraded in performance due to having to transverse across unrelated classes. This caused accuracy and time issues since youtube’s classification problem had millions of classes (imagine how many different kinds of videos there are on youtube). Below is a graphic of the model used. At serving time, they noted that the scoring from the softmax layer was unnecessary, the problem would turn into a nearest neighbor search. A grahic showing the model for candidate generation is below.

For those well versed in this area, you may wonder why a matrix multiplication system was not used. The team noted that the ability to add in arbitrary features was much easier to accomplish in a deep neural net. Matrices with concrete dimensions are a pain to deal with when adding or removing features. With as many potential features that Youtube’s videos could have, it was easier to use the neural net. This became extremely important since the team needed a way to deal with certain features over-centralizing. An example of one of these features would be age of videos or topic of videos. In short, the model would naturally fill a recommendation page with videos that had the same topic or “labels” as what was just searched or viewed for by the user. One may think that this is perfect, but it’s better to think about viewers watching content in a sequential form. This leads us to the next concept used by the team. Solving a surrogate problem, rather than the actual problem to deal with these over-centralizing features.

The Surrogate Problem

Recommendation is often tied to a surrogate problem. The original problem for Youtube was to solve, “what video does the viewer want to watch?” As previously mentioned, showing a recommendation page full of Taylor Swift videos, right after the viewer just watched one, actually didn’t perform well in their testing. On the other hand, video popularity was extremely variant and the model would average popularity over several weeks. Most videos pull their Unfortunately, the model naturally filled the recommendation page up with extremely recent videos all on the most recently viewed topic. To handle this, Youtube changed their focus to solving a surrogate problem. Their surrogate problem was chosen to be “what video would the next most likely video watched by the user be?” This showed great improvement in their A/B testing. But how did they translate this into changes for the model? An example of this change would be handling video age.
Recently created videos are heavily sought after, but the model would average a video’s popularity and likelihood of being watched over a few weeks. Resolving this was done by feeding age as a feature during training. A figure below shows the impact of video candidate generation after including the age feature.

To resolve this, the team chose to randomly select video watches from the user’s history and remove that video’s labels(i.e. category of content) from being input into the model. This left only the user’s previous actions leading up to the watch, which could range over several topics and would then ensure the most recent video’s label (ie “taylor swift”) overwhelming the recommendation page. Effectively, the way I understand this, is they’re keeping their model from overfitting to criteria that are normally over-centralizing just like the concept in statistics is to reduce how outliers impact the mean as we attempt to create an understanding of things broadly. This handled overloading the feed with one recently searched category, but does not handle only viewing recently uploaded videos.

These are the basic ideas of candidate generation in Youtube’s recommender. Ranking is accomplished with another deep neural net, similar to candidate generation, that uses logistic regression to assign a rank to each video based on a set of features that is significantly larger than what was used for candidate generation. This increased set of features is possible since the number of videos has been reduced from millions to hundreds, thus allowing “room” for other data.

Ranking

       Features were in several groups. Categorical, continuous, and ordinal; with some categories being binary or having millions of options pending the user’s last search. Features were also organized by whether they contributed to a single value or a set of values. An example of a single value feature would be a video ID, where a set of values could be a list of the last 10 video ID’s viewed by the user. Features were also classified by whether they were classified properties of the item, aka impressions, or properties of user/context, aka queries. Query features were computed once per ranking request and impression features were computed for each video ranked.
       Features are engineered by human resources to convert user and video data into useful features that relate to how that video impression was scored. This is quite the challenge, but have found that the most important signals for this are the ones that describe user interaction with previous videos or related items like ads. Strong features were found to be how many videos a user has watched from that channel or the last time the topic was viewed. These also generalize well. Key features tend to be frequency viewed and last time viewed so that recommendations can keep up with viewer interests as they develop.
       Categorical features were captured by embeddings, similar to the ones described earlier, using vocabulary sets. Continuous features are all normalized before being fed into the model. These features could be univalent or multivalent, meaning they could contribute to one or more values.

Modeling Expected Watch Time

       Using the features in ranking to model expected watch time is the core of video ranking for recommendations. These rankings are calculated by weighted logistic regression, which was created for this very purpose. Positive examples where the video was watched are weighted by the amount of time they were watched for. Negative examples all receive the same weight. Odds learned by the regression come out to be : \[\frac{\sum(T_i)}{N-k}\] where N is the number of examples, k is the number of positive examples, and T is the watch time of the ith example.Ultimately, the ranking comes out to be the \(\mathbb{E}[T](1+P)\), the expectation of watch time, T, and P being the probability of being clicked on. What’s interesting to think about is that each ranked video would then likely have some probability of being clicked and that if the viewer clicks it, the model would have an expectation of how long the viewer would watch it. You can think about that the next time you’re scrolling through your recommended feed.
       Since the model was trained on both videos that were and were not clicked, and uses the expectation of watch time as a component, something that seems uncaptured by this weighting method is the predication that for a video to be watched, it must be clicked. Using an expectation of watch time and its product with the probability of being clicked to create a ranking seems like an odd method. Of course, I can’t really suggest a better way, but I thought it interesting to think about!
       On this note, there were experiments performed to capture positive examples receiving lower scores than negative examples and were called mispredicted watch time. Recall that positive means it was watched and negative means it was not. Tests were made to attempt to reduce this as a fraction over total watch time by using hidden layers in the neural net, but at the cost of processing time.

Conclusions

The conclusions paragraph in the paper is so absolutely well written I have a hard time not just pasting it here. I highly recommend anyone interested in the topic read that. Overall, my personal takeaways are that classification can be used to answer problems pending next best choice by classifying examples by their likelihood of matching the characteristics of the next best choice. Youtube used a double neural net setup to generate candidates and then to rank them, to filter videos for their users. Surrogate problems can help change the perspective in which an engineer approaches a problem that can have substantial improvement in solving the “real” problem. I really quite enjoyed seeing how their team translated the “real life issue” into what it meant for the model. An example of this is handling the video age problem by inputting age as a feature into training, even when it meant inputting some manually selected values in negatives or zero value features, to accomplish the behavior they wanted to see from the net.

Asides/Lookups/Notes:

Deep Neural Networks for Youtube Recommendations by Paul Covington, Jay Adams, Emre Sargin at Google {pcovington, jka, msargin}@google.com Written in 2016

Embeddings: Imagine these as complex labels that are derived by matching video and user content features such as searches, content labels, and other contextual variables (age, demographic, etc.) and are used to identify a match between user and video

Feed Forward Neural Net

Softmax Classifier: While a logistic regression classifier is used for binary class classification, softmax classifier is a supervised learning algorithm which is mostly used when multiple classes are involved. Softmax classifier works by assigning a probability distribution to each class

Hierarchical softmax: Effectively a hierarchical form of binary approximation where words are replaced into the in the softmax layer instead of numerical values

Softmax Layer: In the output layer of a neural net, the last set of nodes may all pop out some numerical value (we’ll say a sigmoid value meaning between 0 and 1). What if the value the node spits out is .5? This also doesn’t take into account what the other nodes in the last layer state either. To deal with this, a softmax activation function is used to incorporate the likelihood that an observation is apart of whichever class that node suggests (as opposed to other nodes) which accomodates a multi-class problem (rather than the typical sigmoid/binary classification problem).

A/B Testing: A/B testing is used in experimentation. Users are split into two groups where each receive a specific version of a variable to be tested. This allows a clear indication of better performance.

Surrogate Problem: A secondary or alternative method of viewing an original problem that accomplishes the same goal or solves the issue. For example, an original problem could be “What should I recommend this user to watch?”, while its surrogate problem is determining “How much time will this user watch this video for?” By answering the second question, we have an effective way of solving the original problem. Thus the 2nd question about watch time would be considered a surrogate problem.