This document contains course notes from the Coursera course Introduction to Recommender Systems by Joseph A. Konstan and Michael D. Ekstrand

Taxonomy of Recommendation Systems

A recommender system is analyzed based on the following 8 dimensions:

  1. Domain:
    • Content to Commerce and beyond
    • New items (e.g. movies, books) vs re-recommend old ones (e.g. groceries, music)
  2. Purpose
    • Recommendation is an end in themselves (sales, information)
    • Education of user/customer
    • Build a community of users/customers around products or content
  3. Recommendation Context
    • What is the user doing at the time of recommendation? (shopping, listening to music, hanging out with other people)
    • How does context constrain the recommender? (Groups, automatic consumption vs suggestion, level of attention, level of interruption)
  4. Whose Opinions
    • Experts
    • Phoaks (People Helping One Another Know Stuff)
    • People like you
  5. Personalization Level
    • Generic (non-personalized)
    • Demographic (matches a target group)
    • Ephemeral (matches current activity)
    • Persistent (matches long term interests)
  6. Privacy and Trustworthiness
    • Who knows what about me?
    • Is the recommendation honest?
      • biases built-in by operator (“business rules”)
      • vulnerability to external manipulation
      • transparency of recommenders
  7. Interfaces
    • Types of Output (predictions, recommendations, filtering)
    • Types of Input (explicit, implicit)
  8. Recommendation Algorithms
    • Three basic concepts that every recommender needs
    • Users (may have user attributes and a user model)
    • Items (may have item attributes)
    • Ratings (where users meet items e.g ratings, purchases)
    • (Community)

Recommendation Algorithms

Types of Recommendations

  • Summary Stats:
    • Non-personalized (best seller, most popular, trending)
    • Summary of community rating (best liked e.g. trip advisor)
  • Content-based filtering
    • User ratings x item attributes => Model
  • Personalized collaborative filtering
    • User-user CF: select a neighborhood of people with similar taste and use their opinoon
    • Item-item CF: pre-compute similarity among items using the ratings. Use your ratings of items to triangulate on recommendations
    • Dimensionality reduction - tastes yield a lower dimensionality matrix. Compress and use a taset representation

Introduction to Non-Personalized Recommenders

Preferences and Ratings

Preference Model

Preferences can be classified into two groups

  • Explicit ratings e.g. ratings, reviews, votes
  • Implicit ratings e.g. click, purchase, follow

Explicit Ratings

  • Ask users what they think?
  • Common way to get this is the star rating (e.g. Amazon, Goodreads )
  • Design questions:
    • 5? 7? 10?
    • half-stars? (research shows customers like this but does not add incremental value to recommendations)
  • Thumbs and Likes
  • Vote up/down
  • Common with ephemeral items that don’t have a long time-span
    • News aggregators (e.g. Reddit, Digg)
    • Q&A (e.g. stack overflow)
    • Youtube
  • Very low cost to rate
  • Other interfaces
  • Continuous scale
  • Pairwise preference
  • Hybrid (e.g. 1-100, never again)
  • Temporary (e.g. Pandora 30-day suspend)
  • When are ratings provided?
  • Consumption - during or immediately after experiencing an item that doesn’t suffer from memory effect
  • Memory - sometime after the consumption has happened
  • Expectation - user has not consumed the item (e.g. houses, cars)
  • Difficulties with ratings
  • Are ratings reliable and accurate? Movie ratings may change over time for the same user
  • User preferences may change over time
  • What does a rating mean? Different users could use the rating system to signify different things

Implicit Data

  • Data collected from user actions - based on their preference, but without explicitly stating preference
  • Time spend as a proxy for rating
  • How long the user spent reading is a proxy for how they liked it
  • Listening and watching states how users liked it
  • Binary actions
  • Click on link (ad, result, cross-reference)
  • Don’t click on a link
  • Purchase
  • Follow a person/topic; Friend
  • Subtleties/Difficulties
  • What does the action mean?
    • They may purchase but still hate the item
    • Don’t click: expect bad, or didn’t see?
  • How to scale/represent actions? How do you represent and combine multiple actions?
  • Lots of opportunity to be creepy
    • Users don’t know what can be done with the data and may be creeped out
    • Education may help - telling users how you came up with recommendations

Predictions and Recommendations

  • Predictions estimate of how much you will like an item
  • Recommendations are suggestions for items you might like
  • The balance between organic (presenting 5 results, or showing user ratings) and explicit (suggesting top 5 or stating your predicted rating) is something you have to balance for as you design recommendation systems

Scales and Normalization

  • Sparsity, inconsistency and time duration make data messy
  • Ranking considerations
  • Confidence - how confident are you that this item is good?
  • Risk tolerance - high risk, high reward OR conservative recommendation
  • Domain and business consideration
  • Damped means (for low ratings)
  • Assume that without evidence, everything is average (\(\mu\))
  • Below, k controls the strength of evidence \[ \frac{\sum_kr_{ui} - k\mu}{n+k}\] where \(r_{ui}\) is the rating on item \(u\) by user \(i\) and \(n\) is the number of users
  • Statistical confidence intervals (Reddit uses the Wilson interval for binomial to rank comments)
  • Domain consideration: time (old articles/ratings may not be relevant)
  • Hackernews scores items as: \[ \frac{(U-D-1)^\alpha}{(t_{now}-t_{post})^\gamma}P \] where \(U-D-1\) is the net upvotes, \(\alpha\) is the polynomial decay factor set to 0.8 to dampen the effect of additional votes on high-voted items, and the denominator is the duration (in hours) with a polynomial decay \(\gamma\) of 1.8. \(P\) is a penalty term (e.g. penalty for polls, controversial topics)
  • Reddit uses the formula (as of 2010) \[log_{10}max(1,|u-D|) + \frac{sign(U-D)t_{post}}{45000}\]

Content-based Recommenders

Content is a set of descriptors that describe the things that we are recommending

Key Ideas

How to build a profile?

How to build a preference?

How to use preferences?

Case-based recommendation

  • A technique from AI
  • Structure a database of cases around a set of relevant attributes (e.g. camera price, zoom, pixels)
  • Query based on an example or attribute, and retrieve relevant cases

Case-based approaches are most helpful for ephemerally-personalized experiences (e.g. shopping - suggest similar other items). They are often easier to explain to people

Challenges with Content-based techniques

  • Depends on well-structured attributes that align with preferences (consider paintings)
  • Depends on having a reasonable distribution of items across attributes and attributes across items
  • Unlikely to find surprising connections (e.g. chili peppers and lemon with chocolates)
  • Harder to find complements than substitutes

Takeaways

  • Many ways to recommend based on content (product attributes)
  • Long term - build profile of content preferences
  • Short term - build database of cases; navigate
  • Work without a large set of users (but need item data)
  • Good at finding substitutes; good at helping navigate for a purchase; good explanability

TF-IDF

Motivation for TF-IDF comes from search

TF-IDF weighting

  • Term Frequency x Inverse Document Frequency
  • Term frequency = number of occurences of a term in the document (can be a simple count)
  • Inverse document frequency = how few documents contain the term
  • typically log(#documents/#documents with term)

What does TF-IDF do?

  • Automatically demotes common terms
  • Promotes core terms over incidental ones
  • If core term is not used much, TF-IDF will do badly
  • Poor search terms results in bad responses

How does TF-IDF apply to CBF?

  • TF-IDF creates a profile of a document (e.g. movie described as a weighted vector of its tags)
  • These are combined with ratings to create user profiles, and then matched to future documents

Deep dive into Content-based filtering

  • Vector Space Models
  • The prediction is done as follows:
  • Calculate the cosine of the angle between two vectors (profile, item) where each of them is represented in a n-dimensional keyword space
  • The cosine ranges from -1 to 1 (or 0 to 1 for all positive vectors)
  • You can use the top-n choices, or scale for rating-scale predictions
  • A key challenge is figuring out the right weights and factors. Is more really more, or are we merely re-iterating ratings?
  • Limitation - simple model. Cannot handle interdepedencies (you may like violent comedies and historical documentaries, but I may not like violent documentaries)

Summary

  • Content-based filtering is based on assessing the content profile for each item, and can be done from meta-data or user-tagging
  • User profiles are built by aggregating profiles of items rated/consumed, possibly with a weighting scheme
  • Evaluate unrated items by matching item profile with user profile - vector cosine

User-User Collaborative Filtering

Core Assumptions

Our past agreements predicts our future agreements

  1. Our tastes are either individually stable or move in sync with each other
  2. Our system is scoped within a domain of agreement

Predicted rating of user a for item i can be given by the average of the ratings of item i by all other users (or alternately, users within a neighborhood) \[ P_{a,i}= \frac{\sum_{u=1}^n{r_{u,i}}}{n} \]

From the above, we can further personalize the recommendations as follows:

\[ P_{a,i}= \frac{\sum_{u=1}^n{r_{u,i}.w_{a,u}}}{\sum_{u=1}^nw_{a,u}} \]

where the weighting functions \(w_{a,u}\) is the weighting function that is a measure of similarity of users \(a\) and \(u\)

An alternate way is to normalize the ratings for each user

\[ P_{a,i}= \bar{r}_a + \frac{\sum_{u=1}^n({r_{u,i}}-\bar{r}_u)}{n} \]

where \(\bar{r}_a\) is the average rating for user \(a\), and \(\bar{r}_u\) is the average rating for user \(u\).

This scales the rating for users based on how the rating for a given item compared to the average.

The above two approches can be combined to give a general prediction formula

\[ P_{a,i}= \bar{r}_a + \frac{\sum_{u=1}^n({r_{u,i}}-\bar{r}_u).w_{a,u}}{\sum_{u=1}^nw_{a,u}} \]

In the above, we state how much a user likes a given item for which the prediction is being made relative to all items the user has rated, and how close the user is to the user for whom we are making the prediction.

One of the caveats with the above is that the predicted rating can be outside of the range of the actual allowed rating (e.g. 1 to 5 star). This can be addressed by capping ratings below the lower bound and above the upper bound to the bound values.

An example of the weighting function is the Pearson’s correlation:

\[w_{a,u} = \frac{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)(r_{u,i}-\bar{r}_u)}{\sigma_a\sigma_u}\]

The sum above is the sum over the items in common. This raises 2 issues:

  • What if the items we have in common is very small?
  • What if the rating is unary and not a 5-star?

Selecting neighbors

You can select:

  1. All neighbors
  2. Threshold or similarity distance
  3. Random neighbors
  4. Top-N neighbors by similarity or distance

How many neighbors do you use?

  1. The more neighbors the better. However this can add more noise.
  2. Between 25-100 is generally used

With fewer users, you get less noise, but also less coverage

How do you score items from neighborhood?

  1. average
  2. weighted average
  3. multiple linear regression

Weighted average is widely used and works pretty well; also well grounded in utility theory

What’s wrong with data?

  • Users rate differently
  • Some rate low, others high
  • Some use more of the scale than others
  • Averaging ignores these differences
  • Normalization compensates for them

In the formula above, we can alternately calculate the z-score for a user rating.

\[z_{u,i} = \frac{r_{u,i}-\bar{r}_u}{\sigma_u}\]

This normalizes the rating based on not just how much the rating deviates from the average, but also how much of a range the user uses.

The prediction formula then becomes

\[ P_{a,i}= \frac{\sum_{u=1}^n{z_{u,i}.w_{a,u}}}{\sum_{u=1}^nw_{a,u}}.\sigma_a + \mu_a \]

If there are neighbors with -ve similarities, you can either use the absolute similarity in the denominator, or you can not consider neighbors with -ve similarities.

An alternate way of thinking about ratings is as follows:

\[\mu + \hat{\mu}_i + \hat{\mu}_u\]

where \(\mu\) is the global average rating, \(\hat{\mu}_i\) is the item offset i.e. the average deviance between the item rating and the global rating \[ \hat{\mu}_i = \frac{\sum_u(r_{u,i} - \mu)}{\text{# of items}}\] and \(\hat{\mu}_u\) is the user offset defined as the user’s average offset from the item’s average rating \[\hat{\mu}_u = \frac{\sum_i(r_{u,i} - \mu - \hat{\mu}_i)}{\text{# of users}}\]

This is not particularly useful for user-user collaborative filtering, but will be usefuls in other algorithms later on.

The weighting function, as discussed earlier, is given by Pearson’s correlation. \[w_{a,u} = \frac{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)(r_{u,i}-\bar{r}_u)}{\sqrt{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)^2}{\sqrt{\sum_{i=1}^m(r_{u,i}-\bar{r}_u)^2}}}\]

If two users have one item, there is not much to compare on. One way to counteract this is by using significance weighting. You multiply the similarity by \(1/min(50,|R_a \cap R_u)\). So if two users have 25 items in common, it will cut the similarity into half compared to users that have 50 or more items in common.

Another way is to introduce a constant that weighs down the similarity weight until you exceed a certain number of common items, as with \(K\) below \[w_{a,u} = \frac{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)(r_{u,i}-\bar{r}_u)}{\sqrt{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)^2}{\sqrt{\sum_{i=1}^m(r_{u,i}-\bar{r}_u)^2 + K}}}\]

Another similarity measure is to use Spearman’s rank correlation. In practice, it doesn’t work as well

Another approach is vector co-sine, where the rating of items by a user is treated as a vector, and then we calculate the co-sine of the angle between 2 users ratings.

\[sim(a,u) = \frac{\bar{a}.\bar{u}}{||a||.||u||}\] \[sim(a,u)=\frac{\sum_i(r_{a,i}.r_{u,i})}{\sqrt{\sum_ir_{a,i}^2}\sqrt{\sum_ir_{u,i}^2}}\]

If we normalize the user ratings by subtracting the user’s average rating, then the above is equivalent to Pearson’s correlation. There is one key difference, though. The above formula takes a 0 for all cases where the user has not rated, and so the formula is scaled down when users don’t have enough items in common

Some of the early literature showed Pearson’s correlation doing better than vector cosine with raw ratings. However some recent literature shows that vector cosine with normalized rating does as well or better than Pearson’s correlation with significance weighting.

Additional things you can try out:

  1. Cluster users - doesn’t work as well when done naively
  2. Pre-computing user similarity offline - this may not capture user activity that moves them with relation to others

Evaluation

In the early days:

In business:

4 Themes:

Basic Accuracy Metrics

Mean Absolute Error (MAE)

An error is the divergence of the prediction from the actual opinion (rating)

Absolute error: \(|P-R|\)

MAE \[\frac{\sum_{ratings}|P-R|}{\text{# ratings}}\]

Mean Squared Error (MSE)

MSE penalizes large errors more than small ones

\[\frac{\sum_{ratings}(P-R)^2}{\text{# ratings}}\]

Squared error is not on an intuitive scale

Root Mean Square Error (RMSE)

\[\sqrt{\frac{\sum_{ratings}(P-R)^2}{\text{# ratings}}}\]

Summation We can either average across all ratings, or average over user averages. The latter weighs each user rating similarly irrespective of the number of users. You ideally want to calculate both. If MAE across ratings is much lower than MAE by user, then you are doing better for people with lots of ratings that those with fewer ratings. If they are similar, then you are doing a balanced job

When comparing different algorithms:

  • You must use the same data and scale
  • If coverage is different
    • Check against common subset
    • Supplement algorithm with default for full coverage

Error metrics are highly correlated with each other. One of the big drawbacks of error algorithms is that error can be dominated by irrelevant parts of the item space.

Decision Support Metrics

Decision Support measures how well a recommender helps users make good decisions. Good decisions are about choosing ‘good’ items and avoiding ‘bad’ ones.

For predictions, 4 stars vs 2.5 stars is worse than 2.5 stars vs 1 star.

For recommendations, top of list is what matters most.

An error is an ad-hoc measure of wrong predictions that involves, for ratings say, classifying them into good and bad.

Reversals are large mistakes -e.g. off by 3-points on a 5-point scale. The intuition is that these are really bad e.g. off by 3 points on a 5-point scale.

Information Retrieval Metrics

Precision is the percentage of selected items that are relevant (i.e. how effectively the filter eliminates irrelevant items)

\[P = \frac{N_{rs}}{N_s} \]

Recall is the percentage of relevant items that are selected (i.e. how well the filter avoids eliminating relevant items)

\[R = \frac{N_{rs}}{N_r}\]

Precision is about making sure what gets recommended is mostly useful. A high precision filter is one that the user is not wasting time looking at non-relevant stuff. The assumption with precision is that there are more relevant items than you care about, so you care mostly about fewer irrelevant items returned

Recall is about not missing something that is relevant. The assumption with recall is that you have the time to filter through the items to find the one particular item you need.

F1 metric tries to find a balance between the above two.

\[ F1 = \frac{2.P.R}{P+R} \]

Problems with precision and recall:

Problem #1:

  • Need ground truth for all items (but if we have ground truth, why bother with a recommender)
  • Addressed with fake precision/recall by limiting to rated items, or use human-experiments that compute precision/recall over some random subset

Problem #2:

  • Targeted at the entire dataset - not targeted on top-recommended items. Precision and recall are inherently about the full query
  • Resolved through P@n and R@n - Precision@n is the number of top-n items that are good i.e. $P@n = $
  • Recall@n is the same as the Precision@n

Mean Average Precision (MAP)

  • Hybrid between looking at rank vs decision support
  • Calculate precision at each rank for a query, and average them over multiple queries
  • Equivalent to calculating the area under the precision-recall curve and averaging it over multiple queries
  • Not standardized, and hence advised against using it

\[ MAP = \frac{\sum_{q=1}^{Q}AveP(q)}{Q} \]

where

\[AveP = \frac{\sum_{k=1}^nP(k)*rel(k)}{\text{#relevant docs}}\]

Receiver Operating Characteristics (ROC)

  • True positive (y-axis) = proportion of all that should have been recommended that are
  • False positive (x-axis) = proportion of all those that should not have been recommended that are

Summary

  • All of the above metrics are generally correlated with each other
  • Precision@n and overall precision are the most widely used (and easily understood)
  • ROC provides insight if the goal is to tune the metric and find a ‘sweet spot’
  • None of these metrics overcome the problem of being based on rated items only (and the inherent variation that comes from this limitation)

Rank Metrics

Overview Prediction accuracy answers the question how well the recommender does at estimating preference

Decision support metrics answers how well the recommender does at finding good things

Rank accuracy metrics look at how well the recommender estimates relative preference?

Metric families

  • Mean reciprocal rank
  • Spearman rank correlation
  • Discounted Cumulative Gain
  • Fraction of Concordant Pairs

Mean Reciprocal Rank (MRR)

Reciprocal rank (RR): 1/i where i is the rank of the first ‘good’ item

RR measures how far you have to go to find something good. MRR is an average over all test queries

Spearman Rank Correlation

This measures the relative ranking of two items. This is essentially the Pearson’s correlation over ranks. If two items are tied at the same rank, then they are averaged

\[ \frac{\sum_{i}(r_1(i)-\mu_1)(r_2(i)-\mu_2)}{\sqrt{\sum_i(r_1(i)-\mu_1)^2}\sqrt{\sum_i(r_2(i)-\mu_2)^2}}\]

where one of \(r_1(i)\) and \(r_2(i)\) are the actual and predicted ranks for item \(i\)

Spearman’s correlation weighs each item the same irrespective of whether it is at rank 1 or rank 21, which is one of the shortcoming of this method, since deviations at the top are more visible than at the bottom

Discounted Cumulative Gain

This measures the utility of item at each position in the list, discounted by their position, and normalize by total achievable utility

\[DCG(R) = \sum_iu(i).d(i)\] \[nDCG(R) = \frac{DCG(R)}{DCG(R_{perfect})}\]

The utility of the item can be the user rating of the item, or 1 if clicked and 0 otherwise.

The discount is a function that decreases as you go down the list. \[d(i) = \frac{1}{min(1,log_2i)}\]

Another discount is the half-life utlility (Breese, 1996) \[d(i) = \frac{1}{2^{\frac{r(i)-1}{\alpha - 1}}}\]

The above metric uses an exponential decay i.e. users are exponentially less likely to click on each successive item

Fraction of Concordant Pairs

Looks at the fraction of all pairs that it puts in the correct order

Summary

nDCG is increasingly common. MRR is also used.

Fallacy of Hidden Data Evaluation

Recommenders are about helping people find new stuff they’d like

Hidden-data evaluation penalize algorithms that find new stuff

In the end, for real relevance, we need to look at user experience

More Metrics

Coverage

Coverage represents the percentage of products for which the recommender makes a prediction

  • Or a prediction that is personalized
  • Or a prediction above the confidence threshold

Almost always, it is a simple percentage

Directly relevant when dealing with predictions. It says what percentage of movies you can get a star rating score for. It is often used as a background metric in top-N recommenders (i.e. what percentage of items will appear in someone’s top-N)

Diversity

Diversity measures how different the items recommended are

One can begin with the pairwise similarity metric. This can be used to calculate an intralist similarity that is the average pairwise similarity metric for all pairs in the list. Lower scores imply higher diversity

Process of diversification: compare what item (n) that reduces diversity, and replace it with the (n+1)th item. If readers prefer a wider set of interests, then diversification may be helpful

Alternatives to diversification is

  • Clustering approaches allow diversification through bundling
  • Scatter gather: Users get a set of items, they select ones they are interested in them. The system looks at them and re-bundles them.

Business goal: don’t turn away customers by pigeon-holing them into a narrow part of your catalog

Serendipity

Surprise, delight, not the expected result

\[ seren = \frac{1}{N}\sum_{i=1}^Nmax(Pr(s_i)-Prim(s_i),0)*isrel(s_i) \]

where \(Pr(s_i)\) is the predicted score, \(Prim(s_i)\) is a predicted score based on a primitive estimator, and \(isrel(s_i)\) is the relevancy of the prediction

In practice, if you download items that are highly popular, you will increase serendipity.

This tends to require experimentation and tuning. One may use a diversification factor approach, or a constant down-weighting by popularity

The business goal is to get people to consume the less popular item

Business objectives and metrics

  • Immediate lift
  • Net lift (subtract cost of returns)
  • Time to next transaction
  • Long-term customer value
  • Referrals

Summary Metrics can capture some fuzzier objectives such as producing a diverse, delightfully surprising set of recommendations; or whether recommendations can go deep into the “long tail” of the product catalog and not just a few oft-recommended items. Experimental data shows that these may be worth sacrificing some accuracy for.

Experiment Protocols

The goal of offline evaluation is to estimate the recommender’s quality.

Evaluation approach: k-cross fold evaluation. Split data into k-parts and hold-out one for evaluation, and repeat and average over hold-out performance

Most published research uses k=5 or 10. Higher k gives more training but requires more evaluations.

Often, instead of splitting the ratings, we split based on users. We split the user on test and query. For each fold, you take out a fixed fraction of their ratings (test), and throw the rest (query) into training. The test rating can either be picked randomly (default in lens-kit), or you could use the user’s 10 most recent ratings. Next we compute the RMSE/MAE/nCDG per user. We average for the fold, and then we average folds per metric.

Unary data evaluation

Unary data is positive-only (purchase, like, click) - often studied under ‘implicit feedback’ but not exactly the same thing. We do not know the actions that the user evaluated but did not pick.

Metrics

  • Precision/Recall/MAP: but is bad really bad?
  • MRR (Mean Reciprocal Rank): still gets pushed down
  • % at or before rank: histogram of raw data for MRR
  • nDCG: first item may still be misjudged

Mitigation Strategies

Synthesize unary data to get negatives (e.g. use ratings data, and recommend from rated data) Limit domain of recommendation to say * Another promising strategy is spam/non-spam detection in cases where there is missing user data

Summary

Evaluating recommenders is hard. With unary data it is more so.

User level evaluation

Metrics only tell part of the story. What we really want to know is what users prefer? How they balance the different metrics that users prefer?

Spectrum of techniques

  • Usage logs
  • Polls, surveys, focus groups: measures overall desire, context, usage pattern
  • Lab (and online Lab) experiments
  • Field experiments and Trials

Lab experiments

  • Bring people into a lab and see how they pick recommendations. Likert-scale questions common. This could also be conducted online.

Item-Item Collaborative Filtering

Disadvantages of user-user filtering

Item-item similarity fairly stable

Item similarity is a route to computing a prediction of users item preference

Computation: Two-step process

Item Item top-N

Main limitation: Item-item recommendations lower serendipity

Components

Item similarities

Typically 20 neighbors are picked

Scoring: For each item to score

Item-Item with unary data

Dimensionality Reduction

Formulation

We want to decompose a mxn matrix (where m is the number of users and n is the number of items) \(R\) into \[ R=U{\Sigma}V^T\] where \(U\) is the user-feature matrix, \(V\) is the item-feature matrix and \(\Sigma\) are the weights (singular values in linear algebra), and is a diagonal matrix

We only keep the \(k\) largest singular values, and delete the rest. This gives us an mxk user-feature matrix \(U\) and an nxk item feature matrix \(V\) and the weights have k diagonal items. This is known as the best rank-R approx. by global RMSE

The prediction of item i for user a is given by: \[p_{a,i} = \sum_{f=1}^{k}u_{a,f}{\sigma}_fv_{i,f}\]

For new users, or existing users that rate new items, we can get the user-feature vector as below

\[u_a = {\Sigma}V^T{r_a}^T\]

This is referred to as folding. If a new item is added, we can use a corollary of the above approach. This is done since it is computationally expensive to re-compute the SVD

For missing data, impute means or pre-normalize data

Training SVD

Funk-SVD based on a gradient-descent approach used by Simon Funk for the Netflix prize.

  • Train each feature one at a time
  • Ignore missing values
  • Learn offset from personalized mean
  • Fold singular values into left-right matrices

We use gradient descent as follows:

** Add notes for the rest of the SVD lesson **

Extending Matrix Factorization

The predicted rating of user a for item i according to the linear model is given by:

\[ r_{a,i} = \mu + b_a + b_i + {\alpha}t_{week}\] where \(b_a\) is the user bias, and \(b_i\) is the item bias, and the last term represents any day-of-the-week bias

According to the Funk-SVD calculation, we have

\[ r_{a,i} = \mu + b_a + b_i + u_a*v_i\]

Thus, the SVD is a version of the linear regression model with latent features

The SVD++ modele extends this by user preference encoded as additional features

\[ r_{a,i} = b_{a,i} + ({\mu}_a + \frac{1}{N_{ra}}\sum_{j{\in}i}y_i).v_i\]

where \(u_a\) is the user-feature vector

SVD feature generalizes this further

CLiMF optimizes Mean Reciprocal Rank (MRR) of a user’s recommendation

Factorization Machines (combines matrix factorization with support vector machines)

References