This document contains course notes from the Coursera course Introduction to Recommender Systems by Joseph A. Konstan and Michael D. Ekstrand
A recommender system is analyzed based on the following 8 dimensions:
Types of Recommendations
Alternate version of the above: \[\frac{\frac{X\&Y}{X}}{\frac{!X\&Y}{!X}}\] This tracks increase in Y associated with X
Association rules: \[\frac{P(X\&Y)}{P(X)*P(Y)}\]
Ratings have self-selection bias - in Zagat restaurant ratings, things that people choose not to go to don’t get reported
Preferences can be classified into two groups
Explicit Ratings
Implicit Data
Content is a set of descriptors that describe the things that we are recommending
Key Ideas
How to build a profile?
How to build a preference?
How to use preferences?
Case-based approaches are most helpful for ephemerally-personalized experiences (e.g. shopping - suggest similar other items). They are often easier to explain to people
Motivation for TF-IDF comes from search
TF-IDF weighting
What does TF-IDF do?
How does TF-IDF apply to CBF?
Deep dive into Content-based filtering
Summary
Our past agreements predicts our future agreements
Predicted rating of user a for item i can be given by the average of the ratings of item i by all other users (or alternately, users within a neighborhood) \[ P_{a,i}= \frac{\sum_{u=1}^n{r_{u,i}}}{n} \]
From the above, we can further personalize the recommendations as follows:
\[ P_{a,i}= \frac{\sum_{u=1}^n{r_{u,i}.w_{a,u}}}{\sum_{u=1}^nw_{a,u}} \]
where the weighting functions \(w_{a,u}\) is the weighting function that is a measure of similarity of users \(a\) and \(u\)
An alternate way is to normalize the ratings for each user
\[ P_{a,i}= \bar{r}_a + \frac{\sum_{u=1}^n({r_{u,i}}-\bar{r}_u)}{n} \]
where \(\bar{r}_a\) is the average rating for user \(a\), and \(\bar{r}_u\) is the average rating for user \(u\).
This scales the rating for users based on how the rating for a given item compared to the average.
The above two approches can be combined to give a general prediction formula
\[ P_{a,i}= \bar{r}_a + \frac{\sum_{u=1}^n({r_{u,i}}-\bar{r}_u).w_{a,u}}{\sum_{u=1}^nw_{a,u}} \]
In the above, we state how much a user likes a given item for which the prediction is being made relative to all items the user has rated, and how close the user is to the user for whom we are making the prediction.
One of the caveats with the above is that the predicted rating can be outside of the range of the actual allowed rating (e.g. 1 to 5 star). This can be addressed by capping ratings below the lower bound and above the upper bound to the bound values.
An example of the weighting function is the Pearson’s correlation:
\[w_{a,u} = \frac{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)(r_{u,i}-\bar{r}_u)}{\sigma_a\sigma_u}\]
The sum above is the sum over the items in common. This raises 2 issues:
You can select:
How many neighbors do you use?
With fewer users, you get less noise, but also less coverage
How do you score items from neighborhood?
Weighted average is widely used and works pretty well; also well grounded in utility theory
What’s wrong with data?
In the formula above, we can alternately calculate the z-score for a user rating.
\[z_{u,i} = \frac{r_{u,i}-\bar{r}_u}{\sigma_u}\]
This normalizes the rating based on not just how much the rating deviates from the average, but also how much of a range the user uses.
The prediction formula then becomes
\[ P_{a,i}= \frac{\sum_{u=1}^n{z_{u,i}.w_{a,u}}}{\sum_{u=1}^nw_{a,u}}.\sigma_a + \mu_a \]
If there are neighbors with -ve similarities, you can either use the absolute similarity in the denominator, or you can not consider neighbors with -ve similarities.
An alternate way of thinking about ratings is as follows:
\[\mu + \hat{\mu}_i + \hat{\mu}_u\]
where \(\mu\) is the global average rating, \(\hat{\mu}_i\) is the item offset i.e. the average deviance between the item rating and the global rating \[ \hat{\mu}_i = \frac{\sum_u(r_{u,i} - \mu)}{\text{# of items}}\] and \(\hat{\mu}_u\) is the user offset defined as the user’s average offset from the item’s average rating \[\hat{\mu}_u = \frac{\sum_i(r_{u,i} - \mu - \hat{\mu}_i)}{\text{# of users}}\]
This is not particularly useful for user-user collaborative filtering, but will be usefuls in other algorithms later on.
The weighting function, as discussed earlier, is given by Pearson’s correlation. \[w_{a,u} = \frac{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)(r_{u,i}-\bar{r}_u)}{\sqrt{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)^2}{\sqrt{\sum_{i=1}^m(r_{u,i}-\bar{r}_u)^2}}}\]
If two users have one item, there is not much to compare on. One way to counteract this is by using significance weighting. You multiply the similarity by \(1/min(50,|R_a \cap R_u)\). So if two users have 25 items in common, it will cut the similarity into half compared to users that have 50 or more items in common.
Another way is to introduce a constant that weighs down the similarity weight until you exceed a certain number of common items, as with \(K\) below \[w_{a,u} = \frac{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)(r_{u,i}-\bar{r}_u)}{\sqrt{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)^2}{\sqrt{\sum_{i=1}^m(r_{u,i}-\bar{r}_u)^2 + K}}}\]
Another similarity measure is to use Spearman’s rank correlation. In practice, it doesn’t work as well
Another approach is vector co-sine, where the rating of items by a user is treated as a vector, and then we calculate the co-sine of the angle between 2 users ratings.
\[sim(a,u) = \frac{\bar{a}.\bar{u}}{||a||.||u||}\] \[sim(a,u)=\frac{\sum_i(r_{a,i}.r_{u,i})}{\sqrt{\sum_ir_{a,i}^2}\sqrt{\sum_ir_{u,i}^2}}\]
If we normalize the user ratings by subtracting the user’s average rating, then the above is equivalent to Pearson’s correlation. There is one key difference, though. The above formula takes a 0 for all cases where the user has not rated, and so the formula is scaled down when users don’t have enough items in common
Some of the early literature showed Pearson’s correlation doing better than vector cosine with raw ratings. However some recent literature shows that vector cosine with normalized rating does as well or better than Pearson’s correlation with significance weighting.
Additional things you can try out:
In the early days:
In business:
4 Themes:
Prediction vs top-N : Prediction is more about accuracy, possibly decision support, focussed locally. Top-N is more about ranking, decision support; focused comparatively
Mean Absolute Error (MAE)
An error is the divergence of the prediction from the actual opinion (rating)
Absolute error: \(|P-R|\)
MAE \[\frac{\sum_{ratings}|P-R|}{\text{# ratings}}\]
Mean Squared Error (MSE)
MSE penalizes large errors more than small ones
\[\frac{\sum_{ratings}(P-R)^2}{\text{# ratings}}\]
Squared error is not on an intuitive scale
Root Mean Square Error (RMSE)
\[\sqrt{\frac{\sum_{ratings}(P-R)^2}{\text{# ratings}}}\]
Summation We can either average across all ratings, or average over user averages. The latter weighs each user rating similarly irrespective of the number of users. You ideally want to calculate both. If MAE across ratings is much lower than MAE by user, then you are doing better for people with lots of ratings that those with fewer ratings. If they are similar, then you are doing a balanced job
When comparing different algorithms:
Error metrics are highly correlated with each other. One of the big drawbacks of error algorithms is that error can be dominated by irrelevant parts of the item space.
Decision Support measures how well a recommender helps users make good decisions. Good decisions are about choosing ‘good’ items and avoiding ‘bad’ ones.
For predictions, 4 stars vs 2.5 stars is worse than 2.5 stars vs 1 star.
For recommendations, top of list is what matters most.
An error is an ad-hoc measure of wrong predictions that involves, for ratings say, classifying them into good and bad.
Reversals are large mistakes -e.g. off by 3-points on a 5-point scale. The intuition is that these are really bad e.g. off by 3 points on a 5-point scale.
Information Retrieval Metrics
Precision is the percentage of selected items that are relevant (i.e. how effectively the filter eliminates irrelevant items)
\[P = \frac{N_{rs}}{N_s} \]
Recall is the percentage of relevant items that are selected (i.e. how well the filter avoids eliminating relevant items)
\[R = \frac{N_{rs}}{N_r}\]
Precision is about making sure what gets recommended is mostly useful. A high precision filter is one that the user is not wasting time looking at non-relevant stuff. The assumption with precision is that there are more relevant items than you care about, so you care mostly about fewer irrelevant items returned
Recall is about not missing something that is relevant. The assumption with recall is that you have the time to filter through the items to find the one particular item you need.
F1 metric tries to find a balance between the above two.
\[ F1 = \frac{2.P.R}{P+R} \]
Problems with precision and recall:
Problem #1:
Problem #2:
Mean Average Precision (MAP)
\[ MAP = \frac{\sum_{q=1}^{Q}AveP(q)}{Q} \]
where
\[AveP = \frac{\sum_{k=1}^nP(k)*rel(k)}{\text{#relevant docs}}\]
Receiver Operating Characteristics (ROC)
Summary
Overview Prediction accuracy answers the question how well the recommender does at estimating preference
Decision support metrics answers how well the recommender does at finding good things
Rank accuracy metrics look at how well the recommender estimates relative preference?
Metric families
Mean Reciprocal Rank (MRR)
Reciprocal rank (RR): 1/i where i is the rank of the first ‘good’ item
RR measures how far you have to go to find something good. MRR is an average over all test queries
Spearman Rank Correlation
This measures the relative ranking of two items. This is essentially the Pearson’s correlation over ranks. If two items are tied at the same rank, then they are averaged
\[ \frac{\sum_{i}(r_1(i)-\mu_1)(r_2(i)-\mu_2)}{\sqrt{\sum_i(r_1(i)-\mu_1)^2}\sqrt{\sum_i(r_2(i)-\mu_2)^2}}\]
where one of \(r_1(i)\) and \(r_2(i)\) are the actual and predicted ranks for item \(i\)
Spearman’s correlation weighs each item the same irrespective of whether it is at rank 1 or rank 21, which is one of the shortcoming of this method, since deviations at the top are more visible than at the bottom
Discounted Cumulative Gain
This measures the utility of item at each position in the list, discounted by their position, and normalize by total achievable utility
\[DCG(R) = \sum_iu(i).d(i)\] \[nDCG(R) = \frac{DCG(R)}{DCG(R_{perfect})}\]
The utility of the item can be the user rating of the item, or 1 if clicked and 0 otherwise.
The discount is a function that decreases as you go down the list. \[d(i) = \frac{1}{min(1,log_2i)}\]
Another discount is the half-life utlility (Breese, 1996) \[d(i) = \frac{1}{2^{\frac{r(i)-1}{\alpha - 1}}}\]
The above metric uses an exponential decay i.e. users are exponentially less likely to click on each successive item
Fraction of Concordant Pairs
Looks at the fraction of all pairs that it puts in the correct order
Summary
nDCG is increasingly common. MRR is also used.
Coverage represents the percentage of products for which the recommender makes a prediction
Almost always, it is a simple percentage
Directly relevant when dealing with predictions. It says what percentage of movies you can get a star rating score for. It is often used as a background metric in top-N recommenders (i.e. what percentage of items will appear in someone’s top-N)
Diversity measures how different the items recommended are
One can begin with the pairwise similarity metric. This can be used to calculate an intralist similarity that is the average pairwise similarity metric for all pairs in the list. Lower scores imply higher diversity
Process of diversification: compare what item (n) that reduces diversity, and replace it with the (n+1)th item. If readers prefer a wider set of interests, then diversification may be helpful
Alternatives to diversification is
Business goal: don’t turn away customers by pigeon-holing them into a narrow part of your catalog
Surprise, delight, not the expected result
\[ seren = \frac{1}{N}\sum_{i=1}^Nmax(Pr(s_i)-Prim(s_i),0)*isrel(s_i) \]
where \(Pr(s_i)\) is the predicted score, \(Prim(s_i)\) is a predicted score based on a primitive estimator, and \(isrel(s_i)\) is the relevancy of the prediction
In practice, if you download items that are highly popular, you will increase serendipity.
This tends to require experimentation and tuning. One may use a diversification factor approach, or a constant down-weighting by popularity
The business goal is to get people to consume the less popular item
Summary Metrics can capture some fuzzier objectives such as producing a diverse, delightfully surprising set of recommendations; or whether recommendations can go deep into the “long tail” of the product catalog and not just a few oft-recommended items. Experimental data shows that these may be worth sacrificing some accuracy for.
The goal of offline evaluation is to estimate the recommender’s quality.
Evaluation approach: k-cross fold evaluation. Split data into k-parts and hold-out one for evaluation, and repeat and average over hold-out performance
Most published research uses k=5 or 10. Higher k gives more training but requires more evaluations.
Often, instead of splitting the ratings, we split based on users. We split the user on test and query. For each fold, you take out a fixed fraction of their ratings (test), and throw the rest (query) into training. The test rating can either be picked randomly (default in lens-kit), or you could use the user’s 10 most recent ratings. Next we compute the RMSE/MAE/nCDG per user. We average for the fold, and then we average folds per metric.
Unary data is positive-only (purchase, like, click) - often studied under ‘implicit feedback’ but not exactly the same thing. We do not know the actions that the user evaluated but did not pick.
Metrics
Mitigation Strategies
Synthesize unary data to get negatives (e.g. use ratings data, and recommend from rated data) Limit domain of recommendation to say * Another promising strategy is spam/non-spam detection in cases where there is missing user data
Summary
Evaluating recommenders is hard. With unary data it is more so.
Metrics only tell part of the story. What we really want to know is what users prefer? How they balance the different metrics that users prefer?
Spectrum of techniques
Lab experiments
Disadvantages of user-user filtering
Item-item similarity fairly stable
Item similarity is a route to computing a prediction of users item preference
Computation: Two-step process
Item Item top-N
Main limitation: Item-item recommendations lower serendipity
Components
Item similarities
Typically 20 neighbors are picked
Scoring: For each item to score
Item-Item with unary data
purchase count matrix
Use conditional probability
We want to decompose a mxn matrix (where m is the number of users and n is the number of items) \(R\) into \[ R=U{\Sigma}V^T\] where \(U\) is the user-feature matrix, \(V\) is the item-feature matrix and \(\Sigma\) are the weights (singular values in linear algebra), and is a diagonal matrix
We only keep the \(k\) largest singular values, and delete the rest. This gives us an mxk user-feature matrix \(U\) and an nxk item feature matrix \(V\) and the weights have k diagonal items. This is known as the best rank-R approx. by global RMSE
The prediction of item i for user a is given by: \[p_{a,i} = \sum_{f=1}^{k}u_{a,f}{\sigma}_fv_{i,f}\]
For new users, or existing users that rate new items, we can get the user-feature vector as below
\[u_a = {\Sigma}V^T{r_a}^T\]
This is referred to as folding. If a new item is added, we can use a corollary of the above approach. This is done since it is computationally expensive to re-compute the SVD
For missing data, impute means or pre-normalize data
Funk-SVD based on a gradient-descent approach used by Simon Funk for the Netflix prize.
We use gradient descent as follows:
** Add notes for the rest of the SVD lesson **
The predicted rating of user a for item i according to the linear model is given by:
\[ r_{a,i} = \mu + b_a + b_i + {\alpha}t_{week}\] where \(b_a\) is the user bias, and \(b_i\) is the item bias, and the last term represents any day-of-the-week bias
According to the Funk-SVD calculation, we have
\[ r_{a,i} = \mu + b_a + b_i + u_a*v_i\]
Thus, the SVD is a version of the linear regression model with latent features
The SVD++ modele extends this by user preference encoded as additional features
\[ r_{a,i} = b_{a,i} + ({\mu}_a + \frac{1}{N_{ra}}\sum_{j{\in}i}y_i).v_i\]
where \(u_a\) is the user-feature vector
SVD feature generalizes this further
CLiMF optimizes Mean Reciprocal Rank (MRR) of a user’s recommendation
Factorization Machines (combines matrix factorization with support vector machines)
J. Konstan & J. Riedl Deconstructing Recommender Systems, IEEE Spectrum, Sep. 2012
Evan Miller, How not to sort by average rating
Kamba, Bharat, and Albers (WWW ‘95) Krakatoa Chronicle
J. Herlocker, J. Konstan & J. Rield, Explaining Collaborative Filtering, CSCW 2001.
Michael D. Ekstrand, John T. Riedl, and Joseph A. Konstan, Collaborative Filtering Recommender Systems,Foundations and Trends® in Human-Computer Interaction 4, 2 (February 2011), pp 81–173.
Resnick, P. and Sami, R. 2007. The influence limiter: provably manipulation-resistant recommender systems. In Proceedings of the 2007 ACM Conference on Recommender Systems (Minneapolis, MN, USA, October 19 - 20, 2007). RecSys ’07. ACM, New York, NY, 25-32.