This document contains course notes from the Coursera course Introduction to Recommender Systems by Joseph A. Konstan and Michael D. Ekstrand

Taxonomy of Recommendation Systems

A recommender system is analyzed based on the following 8 dimensions:

Domain:
- Content to Commerce and beyond
- New items (e.g. movies, books) vs re-recommend old ones (e.g. groceries, music)
Purpose
- Recommendation is an end in themselves (sales, information)
- Education of user/customer
- Build a community of users/customers around products or content
Recommendation Context
- What is the user doing at the time of recommendation? (shopping, listening to music, hanging out with other people)
- How does context constrain the recommender? (Groups, automatic consumption vs suggestion, level of attention, level of interruption)
Whose Opinions
- Experts
- Phoaks (People Helping One Another Know Stuff)
- People like you
Personalization Level
- Generic (non-personalized)
- Demographic (matches a target group)
- Ephemeral (matches current activity)
- Persistent (matches long term interests)
Privacy and Trustworthiness
- Who knows what about me?
- Is the recommendation honest?
  - biases built-in by operator (“business rules”)
  - vulnerability to external manipulation
  - transparency of recommenders
Interfaces
- Types of Output (predictions, recommendations, filtering)
- Types of Input (explicit, implicit)
Recommendation Algorithms
- Three basic concepts that every recommender needs
- Users (may have user attributes and a user model)
- Items (may have item attributes)
- Ratings (where users meet items e.g ratings, purchases)
- (Community)

Recommendation Algorithms

Types of Recommendations

Summary Stats:
- Non-personalized (best seller, most popular, trending)
- Summary of community rating (best liked e.g. trip advisor)
Content-based filtering
- User ratings x item attributes => Model
Personalized collaborative filtering
- User-user CF: select a neighborhood of people with similar taste and use their opinoon
- Item-item CF: pre-compute similarity among items using the ratings. Use your ratings of items to triangulate on recommendations
- Dimensionality reduction - tastes yield a lower dimensionality matrix. Compress and use a taset representation

Introduction to Non-Personalized Recommenders

Averages lack context
Percentage of X buyers who also bought Y/X buyers
- What if X is not a popular product and Y is popular (e.g. anchovy paste and bananas)
- Doesn’t compensate for overall popularity of Y
Alternate version of the above: \[\frac{\frac{X\&Y}{X}}{\frac{!X\&Y}{!X}}\] This tracks increase in Y associated with X
Association rules: \[\frac{P(X\&Y)}{P(X)*P(Y)}\]
Ratings have self-selection bias - in Zagat restaurant ratings, things that people choose not to go to don’t get reported

Preferences and Ratings

Preference Model

Preferences can be classified into two groups

Explicit ratings e.g. ratings, reviews, votes
Implicit ratings e.g. click, purchase, follow

Explicit Ratings

Ask users what they think?
Common way to get this is the star rating (e.g. Amazon, Goodreads )
Design questions:
- 5? 7? 10?
- half-stars? (research shows customers like this but does not add incremental value to recommendations)
Thumbs and Likes
Vote up/down
Common with ephemeral items that don’t have a long time-span
- News aggregators (e.g. Reddit, Digg)
- Q&A (e.g. stack overflow)
- Youtube
Very low cost to rate
Other interfaces
Continuous scale
Pairwise preference
Hybrid (e.g. 1-100, never again)
Temporary (e.g. Pandora 30-day suspend)
When are ratings provided?
Consumption - during or immediately after experiencing an item that doesn’t suffer from memory effect
Memory - sometime after the consumption has happened
Expectation - user has not consumed the item (e.g. houses, cars)
Difficulties with ratings
Are ratings reliable and accurate? Movie ratings may change over time for the same user
User preferences may change over time
What does a rating mean? Different users could use the rating system to signify different things

Implicit Data

Data collected from user actions - based on their preference, but without explicitly stating preference
Time spend as a proxy for rating
How long the user spent reading is a proxy for how they liked it
Listening and watching states how users liked it
Binary actions
Click on link (ad, result, cross-reference)
Don’t click on a link
Purchase
Follow a person/topic; Friend
Subtleties/Difficulties
What does the action mean?
- They may purchase but still hate the item
- Don’t click: expect bad, or didn’t see?
How to scale/represent actions? How do you represent and combine multiple actions?
Lots of opportunity to be creepy
- Users don’t know what can be done with the data and may be creeped out
- Education may help - telling users how you came up with recommendations

Predictions and Recommendations

Predictions estimate of how much you will like an item
Recommendations are suggestions for items you might like
The balance between organic (presenting 5 results, or showing user ratings) and explicit (suggesting top 5 or stating your predicted rating) is something you have to balance for as you design recommendation systems

Scales and Normalization

Sparsity, inconsistency and time duration make data messy
Ranking considerations
Confidence - how confident are you that this item is good?
Risk tolerance - high risk, high reward OR conservative recommendation
Domain and business consideration
Damped means (for low ratings)
Assume that without evidence, everything is average ($\mu$)
Below, k controls the strength of evidence \[ \frac{\sum_kr_{ui} - k\mu}{n+k}\] where $r_{ui}$ is the rating on item $u$ by user $i$ and $n$ is the number of users
Statistical confidence intervals (Reddit uses the Wilson interval for binomial to rank comments)
Domain consideration: time (old articles/ratings may not be relevant)
Hackernews scores items as: \[ \frac{(U-D-1)^\alpha}{(t_{now}-t_{post})^\gamma}P \] where $U-D-1$ is the net upvotes, $\alpha$ is the polynomial decay factor set to 0.8 to dampen the effect of additional votes on high-voted items, and the denominator is the duration (in hours) with a polynomial decay $\gamma$ of 1.8. $P$ is a penalty term (e.g. penalty for polls, controversial topics)
Reddit uses the formula (as of 2010) \[log_{10}max(1,|u-D|) + \frac{sign(U-D)t_{post}}{45000}\]

Content-based Recommenders

Content is a set of descriptors that describe the things that we are recommending

Key Ideas

Model items according to relevant attributes (takes some human effort)
Model or reveal user preferences by those same attributes
Put those together and you have a recommender

How to build a profile?

User could build their own profile (awkward, and user may not recognize their own preferences as keywords)
Infer profile from user actions (read, buy, click)
Infer profile from explicit user ratings (how to map item preference to attribute preference)
Merge explicit and implicit and merge them into user interactions and preferences, and use them to infer keyword attributes and preferences

How to build a preference?

“Keywords” that users may like/dislike, or not have an opinion on
We can choose the number of items the user chooses (or does not choose) items within each keyword
We could get a lot more sophisticated

How to use preferences?

Given a vector of keyword preferences?
Do we just add up likes and dislikes?
Can we figure which keywords are more or less relevant?
Forward reference: TF-IDF

Case-based recommendation

A technique from AI
Structure a database of cases around a set of relevant attributes (e.g. camera price, zoom, pixels)
Query based on an example or attribute, and retrieve relevant cases

Case-based approaches are most helpful for ephemerally-personalized experiences (e.g. shopping - suggest similar other items). They are often easier to explain to people

Challenges with Content-based techniques

Depends on well-structured attributes that align with preferences (consider paintings)
Depends on having a reasonable distribution of items across attributes and attributes across items
Unlikely to find surprising connections (e.g. chili peppers and lemon with chocolates)
Harder to find complements than substitutes

Takeaways

Many ways to recommend based on content (product attributes)
Long term - build profile of content preferences
Short term - build database of cases; navigate
Work without a large set of users (but need item data)
Good at finding substitutes; good at helping navigate for a purchase; good explanability

TF-IDF

Motivation for TF-IDF comes from search

TF-IDF weighting

Term Frequency x Inverse Document Frequency
Term frequency = number of occurences of a term in the document (can be a simple count)
Inverse document frequency = how few documents contain the term
typically log(#documents/#documents with term)

What does TF-IDF do?

Automatically demotes common terms
Promotes core terms over incidental ones
If core term is not used much, TF-IDF will do badly
Poor search terms results in bad responses

How does TF-IDF apply to CBF?

TF-IDF creates a profile of a document (e.g. movie described as a weighted vector of its tags)
These are combined with ratings to create user profiles, and then matched to future documents

Deep dive into Content-based filtering

Vector Space Models
The prediction is done as follows:
Calculate the cosine of the angle between two vectors (profile, item) where each of them is represented in a n-dimensional keyword space
The cosine ranges from -1 to 1 (or 0 to 1 for all positive vectors)
You can use the top-n choices, or scale for rating-scale predictions
A key challenge is figuring out the right weights and factors. Is more really more, or are we merely re-iterating ratings?
Limitation - simple model. Cannot handle interdepedencies (you may like violent comedies and historical documentaries, but I may not like violent documentaries)

Summary

Content-based filtering is based on assessing the content profile for each item, and can be done from meta-data or user-tagging
User profiles are built by aggregating profiles of items rated/consumed, possibly with a weighting scheme
Evaluate unrated items by matching item profile with user profile - vector cosine

User-User Collaborative Filtering

Core Assumptions

Our past agreements predicts our future agreements

Our tastes are either individually stable or move in sync with each other
Our system is scoped within a domain of agreement

Predicted rating of user a for item i can be given by the average of the ratings of item i by all other users (or alternately, users within a neighborhood) \[ P_{a,i}= \frac{\sum_{u=1}^n{r_{u,i}}}{n} \]

From the above, we can further personalize the recommendations as follows:

\[ P_{a,i}= \frac{\sum_{u=1}^n{r_{u,i}.w_{a,u}}}{\sum_{u=1}^nw_{a,u}} \]

where the weighting functions $w_{a,u}$ is the weighting function that is a measure of similarity of users $a$ and $u$

An alternate way is to normalize the ratings for each user

\[ P_{a,i}= \bar{r}_a + \frac{\sum_{u=1}^n({r_{u,i}}-\bar{r}_u)}{n} \]

where $\bar{r}_a$ is the average rating for user $a$, and $\bar{r}_u$ is the average rating for user $u$.

This scales the rating for users based on how the rating for a given item compared to the average.

The above two approches can be combined to give a general prediction formula

\[ P_{a,i}= \bar{r}_a + \frac{\sum_{u=1}^n({r_{u,i}}-\bar{r}_u).w_{a,u}}{\sum_{u=1}^nw_{a,u}} \]

In the above, we state how much a user likes a given item for which the prediction is being made relative to all items the user has rated, and how close the user is to the user for whom we are making the prediction.

One of the caveats with the above is that the predicted rating can be outside of the range of the actual allowed rating (e.g. 1 to 5 star). This can be addressed by capping ratings below the lower bound and above the upper bound to the bound values.

An example of the weighting function is the Pearson’s correlation:

\[w_{a,u} = \frac{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)(r_{u,i}-\bar{r}_u)}{\sigma_a\sigma_u}\]

The sum above is the sum over the items in common. This raises 2 issues:

What if the items we have in common is very small?
What if the rating is unary and not a 5-star?

Selecting neighbors

You can select:

All neighbors
Threshold or similarity distance
Random neighbors
Top-N neighbors by similarity or distance

How many neighbors do you use?

The more neighbors the better. However this can add more noise.
Between 25-100 is generally used

With fewer users, you get less noise, but also less coverage

How do you score items from neighborhood?

average
weighted average
multiple linear regression

Weighted average is widely used and works pretty well; also well grounded in utility theory

What’s wrong with data?

Users rate differently
Some rate low, others high
Some use more of the scale than others
Averaging ignores these differences
Normalization compensates for them

In the formula above, we can alternately calculate the z-score for a user rating.

\[z_{u,i} = \frac{r_{u,i}-\bar{r}_u}{\sigma_u}\]

This normalizes the rating based on not just how much the rating deviates from the average, but also how much of a range the user uses.

The prediction formula then becomes

\[ P_{a,i}= \frac{\sum_{u=1}^n{z_{u,i}.w_{a,u}}}{\sum_{u=1}^nw_{a,u}}.\sigma_a + \mu_a \]

If there are neighbors with -ve similarities, you can either use the absolute similarity in the denominator, or you can not consider neighbors with -ve similarities.

An alternate way of thinking about ratings is as follows:

\[\mu + \hat{\mu}_i + \hat{\mu}_u\]

where $\mu$ is the global average rating, $\hat{\mu}_i$ is the item offset i.e. the average deviance between the item rating and the global rating \[ \hat{\mu}_i = \frac{\sum_u(r_{u,i} - \mu)}{\text{# of items}}\] and $\hat{\mu}_u$ is the user offset defined as the user’s average offset from the item’s average rating \[\hat{\mu}_u = \frac{\sum_i(r_{u,i} - \mu - \hat{\mu}_i)}{\text{# of users}}\]

This is not particularly useful for user-user collaborative filtering, but will be usefuls in other algorithms later on.

The weighting function, as discussed earlier, is given by Pearson’s correlation. \[w_{a,u} = \frac{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)(r_{u,i}-\bar{r}_u)}{\sqrt{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)^2}{\sqrt{\sum_{i=1}^m(r_{u,i}-\bar{r}_u)^2}}}\]

If two users have one item, there is not much to compare on. One way to counteract this is by using significance weighting. You multiply the similarity by $1/min(50,|R_a \cap R_u)$. So if two users have 25 items in common, it will cut the similarity into half compared to users that have 50 or more items in common.

Another way is to introduce a constant that weighs down the similarity weight until you exceed a certain number of common items, as with $K$ below \[w_{a,u} = \frac{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)(r_{u,i}-\bar{r}_u)}{\sqrt{\sum_{i=1}^m(r_{a,i}-\bar{r}_a)^2}{\sqrt{\sum_{i=1}^m(r_{u,i}-\bar{r}_u)^2 + K}}}\]

Another similarity measure is to use Spearman’s rank correlation. In practice, it doesn’t work as well

Another approach is vector co-sine, where the rating of items by a user is treated as a vector, and then we calculate the co-sine of the angle between 2 users ratings.

\[sim(a,u) = \frac{\bar{a}.\bar{u}}{||a||.||u||}\] \[sim(a,u)=\frac{\sum_i(r_{a,i}.r_{u,i})}{\sqrt{\sum_ir_{a,i}^2}\sqrt{\sum_ir_{u,i}^2}}\]

If we normalize the user ratings by subtracting the user’s average rating, then the above is equivalent to Pearson’s correlation. There is one key difference, though. The above formula takes a 0 for all cases where the user has not rated, and so the formula is scaled down when users don’t have enough items in common

Some of the early literature showed Pearson’s correlation doing better than vector cosine with raw ratings. However some recent literature shows that vector cosine with normalized rating does as well or better than Pearson’s correlation with significance weighting.

Additional things you can try out:

Cluster users - doesn’t work as well when done naively
Pre-computing user similarity offline - this may not capture user activity that moves them with relation to others

Evaluation

In the early days:

Accuracy and error measures: MAE, RMSE, MSE
Decision support measures: ROC, AUC, Breese score, later precision/recall
Error meets decision-support/user experience: “Reversals” (count of how often a recommender makes a really bad prediction i.e. predicting very low for a highly rated item and vice versa)
User centered metrics: coverage, user retention, recommendation uptake, satisfaction

In business:

Nobody was impressed by accuracy
Increased sales
- Lift, cross-sales, up-sales, conversion

4 Themes:

Prediction vs top-N : Prediction is more about accuracy, possibly decision support, focussed locally. Top-N is more about ranking, decision support; focused comparatively
More than just metrics
- Even simple evaluations are hard
- Comparative evaluation is even harder
Unary Data is often different
- Many metrics are designed for multi-point rating scale
- Some metrics don’t work well for unary data
Dead vs Live recommendations
- Retrospective data looks at how the recommended would have predicted for items already consumed
- Prospective (live experiment) looks at how recommendations are actually consumed

Basic Accuracy Metrics

Mean Absolute Error (MAE)

An error is the divergence of the prediction from the actual opinion (rating)

Absolute error: $|P-R|$

MAE \[\frac{\sum_{ratings}|P-R|}{\text{# ratings}}\]

Mean Squared Error (MSE)

MSE penalizes large errors more than small ones

\[\frac{\sum_{ratings}(P-R)^2}{\text{# ratings}}\]

Squared error is not on an intuitive scale

Root Mean Square Error (RMSE)

\[\sqrt{\frac{\sum_{ratings}(P-R)^2}{\text{# ratings}}}\]

Summation We can either average across all ratings, or average over user averages. The latter weighs each user rating similarly irrespective of the number of users. You ideally want to calculate both. If MAE across ratings is much lower than MAE by user, then you are doing better for people with lots of ratings that those with fewer ratings. If they are similar, then you are doing a balanced job

When comparing different algorithms:

You must use the same data and scale
If coverage is different
- Check against common subset
- Supplement algorithm with default for full coverage

Error metrics are highly correlated with each other. One of the big drawbacks of error algorithms is that error can be dominated by irrelevant parts of the item space.

Decision Support Metrics

Decision Support measures how well a recommender helps users make good decisions. Good decisions are about choosing ‘good’ items and avoiding ‘bad’ ones.

For predictions, 4 stars vs 2.5 stars is worse than 2.5 stars vs 1 star.

For recommendations, top of list is what matters most.

An error is an ad-hoc measure of wrong predictions that involves, for ratings say, classifying them into good and bad.

Reversals are large mistakes -e.g. off by 3-points on a 5-point scale. The intuition is that these are really bad e.g. off by 3 points on a 5-point scale.

Information Retrieval Metrics

Precision is the percentage of selected items that are relevant (i.e. how effectively the filter eliminates irrelevant items)

\[P = \frac{N_{rs}}{N_s} \]

Recall is the percentage of relevant items that are selected (i.e. how well the filter avoids eliminating relevant items)

\[R = \frac{N_{rs}}{N_r}\]

Precision is about making sure what gets recommended is mostly useful. A high precision filter is one that the user is not wasting time looking at non-relevant stuff. The assumption with precision is that there are more relevant items than you care about, so you care mostly about fewer irrelevant items returned

Recall is about not missing something that is relevant. The assumption with recall is that you have the time to filter through the items to find the one particular item you need.

F1 metric tries to find a balance between the above two.

\[ F1 = \frac{2.P.R}{P+R} \]

Problems with precision and recall:

Problem #1:

Need ground truth for all items (but if we have ground truth, why bother with a recommender)
Addressed with fake precision/recall by limiting to rated items, or use human-experiments that compute precision/recall over some random subset

Problem #2:

Targeted at the entire dataset - not targeted on top-recommended items. Precision and recall are inherently about the full query
Resolved through P@n and R@n - Precision@n is the number of top-n items that are good i.e. $P@n = $
Recall@n is the same as the Precision@n

Mean Average Precision (MAP)

Hybrid between looking at rank vs decision support
Calculate precision at each rank for a query, and average them over multiple queries
Equivalent to calculating the area under the precision-recall curve and averaging it over multiple queries
Not standardized, and hence advised against using it

\[ MAP = \frac{\sum_{q=1}^{Q}AveP(q)}{Q} \]

where

\[AveP = \frac{\sum_{k=1}^nP(k)*rel(k)}{\text{#relevant docs}}\]

Receiver Operating Characteristics (ROC)

True positive (y-axis) = proportion of all that should have been recommended that are
False positive (x-axis) = proportion of all those that should not have been recommended that are

Summary

All of the above metrics are generally correlated with each other
Precision@n and overall precision are the most widely used (and easily understood)
ROC provides insight if the goal is to tune the metric and find a ‘sweet spot’
None of these metrics overcome the problem of being based on rated items only (and the inherent variation that comes from this limitation)

Rank Metrics

Overview Prediction accuracy answers the question how well the recommender does at estimating preference

Decision support metrics answers how well the recommender does at finding good things

Rank accuracy metrics look at how well the recommender estimates relative preference?

Metric families

Mean reciprocal rank
Spearman rank correlation
Discounted Cumulative Gain
Fraction of Concordant Pairs

Mean Reciprocal Rank (MRR)

Reciprocal rank (RR): 1/i where i is the rank of the first ‘good’ item

RR measures how far you have to go to find something good. MRR is an average over all test queries

Spearman Rank Correlation

This measures the relative ranking of two items. This is essentially the Pearson’s correlation over ranks. If two items are tied at the same rank, then they are averaged

\[ \frac{\sum_{i}(r_1(i)-\mu_1)(r_2(i)-\mu_2)}{\sqrt{\sum_i(r_1(i)-\mu_1)^2}\sqrt{\sum_i(r_2(i)-\mu_2)^2}}\]

where one of $r_1(i)$ and $r_2(i)$ are the actual and predicted ranks for item $i$

Spearman’s correlation weighs each item the same irrespective of whether it is at rank 1 or rank 21, which is one of the shortcoming of this method, since deviations at the top are more visible than at the bottom

Discounted Cumulative Gain

This measures the utility of item at each position in the list, discounted by their position, and normalize by total achievable utility

\[DCG(R) = \sum_iu(i).d(i)\] \[nDCG(R) = \frac{DCG(R)}{DCG(R_{perfect})}\]

The utility of the item can be the user rating of the item, or 1 if clicked and 0 otherwise.

The discount is a function that decreases as you go down the list. \[d(i) = \frac{1}{min(1,log_2i)}\]

Another discount is the half-life utlility (Breese, 1996) \[d(i) = \frac{1}{2^{\frac{r(i)-1}{\alpha - 1}}}\]

The above metric uses an exponential decay i.e. users are exponentially less likely to click on each successive item

Fraction of Concordant Pairs

Looks at the fraction of all pairs that it puts in the correct order

Summary

nDCG is increasingly common. MRR is also used.

Fallacy of Hidden Data Evaluation

Recommenders are about helping people find new stuff they’d like

Hidden-data evaluation penalize algorithms that find new stuff

In the end, for real relevance, we need to look at user experience

More Metrics

Coverage

Coverage represents the percentage of products for which the recommender makes a prediction

Or a prediction that is personalized
Or a prediction above the confidence threshold

Almost always, it is a simple percentage

Directly relevant when dealing with predictions. It says what percentage of movies you can get a star rating score for. It is often used as a background metric in top-N recommenders (i.e. what percentage of items will appear in someone’s top-N)

Diversity

Diversity measures how different the items recommended are

One can begin with the pairwise similarity metric. This can be used to calculate an intralist similarity that is the average pairwise similarity metric for all pairs in the list. Lower scores imply higher diversity

Process of diversification: compare what item (n) that reduces diversity, and replace it with the (n+1)th item. If readers prefer a wider set of interests, then diversification may be helpful

Alternatives to diversification is

Clustering approaches allow diversification through bundling
Scatter gather: Users get a set of items, they select ones they are interested in them. The system looks at them and re-bundles them.

Business goal: don’t turn away customers by pigeon-holing them into a narrow part of your catalog

Serendipity

Surprise, delight, not the expected result

\[ seren = \frac{1}{N}\sum_{i=1}^Nmax(Pr(s_i)-Prim(s_i),0)*isrel(s_i) \]

where $Pr(s_i)$ is the predicted score, $Prim(s_i)$ is a predicted score based on a primitive estimator, and $isrel(s_i)$ is the relevancy of the prediction

In practice, if you download items that are highly popular, you will increase serendipity.

This tends to require experimentation and tuning. One may use a diversification factor approach, or a constant down-weighting by popularity

The business goal is to get people to consume the less popular item

Business objectives and metrics

Immediate lift
Net lift (subtract cost of returns)
Time to next transaction
Long-term customer value
Referrals

Summary Metrics can capture some fuzzier objectives such as producing a diverse, delightfully surprising set of recommendations; or whether recommendations can go deep into the “long tail” of the product catalog and not just a few oft-recommended items. Experimental data shows that these may be worth sacrificing some accuracy for.

Experiment Protocols

The goal of offline evaluation is to estimate the recommender’s quality.

Evaluation approach: k-cross fold evaluation. Split data into k-parts and hold-out one for evaluation, and repeat and average over hold-out performance

Most published research uses k=5 or 10. Higher k gives more training but requires more evaluations.

Often, instead of splitting the ratings, we split based on users. We split the user on test and query. For each fold, you take out a fixed fraction of their ratings (test), and throw the rest (query) into training. The test rating can either be picked randomly (default in lens-kit), or you could use the user’s 10 most recent ratings. Next we compute the RMSE/MAE/nCDG per user. We average for the fold, and then we average folds per metric.

Unary data evaluation

Unary data is positive-only (purchase, like, click) - often studied under ‘implicit feedback’ but not exactly the same thing. We do not know the actions that the user evaluated but did not pick.

Metrics

Precision/Recall/MAP: but is bad really bad?
MRR (Mean Reciprocal Rank): still gets pushed down
% at or before rank: histogram of raw data for MRR
nDCG: first item may still be misjudged

Mitigation Strategies

Synthesize unary data to get negatives (e.g. use ratings data, and recommend from rated data) Limit domain of recommendation to say * Another promising strategy is spam/non-spam detection in cases where there is missing user data

Summary

Evaluating recommenders is hard. With unary data it is more so.

User level evaluation

Metrics only tell part of the story. What we really want to know is what users prefer? How they balance the different metrics that users prefer?

Spectrum of techniques

Usage logs
Polls, surveys, focus groups: measures overall desire, context, usage pattern
Lab (and online Lab) experiments
Field experiments and Trials

Lab experiments

Bring people into a lab and see how they pick recommendations. Likert-scale questions common. This could also be conducted online.

Item-Item Collaborative Filtering

Disadvantages of user-user filtering

With large item sets and small number of ratings, too often there are points where no recommendation can be made
With millions of users, computing correlations among all uses become expensive
Incremental approaches are expensive
Given that user profiles change quickly, frequent re-estimation is needed

Item-item similarity fairly stable

Item ratings don’t change much over time due to high number of ratings
Items don’t change much

Item similarity is a route to computing a prediction of users item preference

Computation: Two-step process

Compute similarity between pairs of items
Correlation between rating vectors
Cosine of item rating vectors
Adjusted cosine
Predicted user-item rating
Weighted sum of item neighbors
linear regression to estimate rating

Item Item top-N

Simplify models by limiting items to small neighborhoods of k most-similar items
For a profile set of items, compute/merge/sort the k-most similar items for each profile item

Main limitation: Item-item recommendations lower serendipity

Components

item similarity function - a way of calculating similarities
model builder
neighborhood selection strategy for selecting neighboring items
item score aggregation function

Item similarities

cosine similarity: normalize user ratings first. Then subtract user mean or item mean (preferable) from each rating

Typically 20 neighbors are picked

Scoring: For each item to score

find similar items the user has rated
compute weighted average of the user’s rating

Item-Item with unary data

Need some matrix to represent the data
logical (1/0) user-item ‘purchase’ matrix
purchase count matrix
Standard mean centering is not meaningful
But we can normalize user vectors to unit vectors
Alternately, take a log of each count
Cosine similarity still works
Use conditional probability

Dimensionality Reduction

Rating matrix is an overfit representation of user tastes and item descriptions
- Problems of synonymy: what if you like Shakespeare: The History and Tragedies, and I like Hamlet. Since we didn’t rate the same items, item-item or user-user won’t be able to pick on this
- Computational complexity
- Ideal would be to have a more compact representation of user tastes and item descriptions
History: Latent semantic indexing
- In information retrieval, keyword vectors had the problem that queries and documents were poorly represented. They wanted to recognize concepts, not words
- SVD can be used to create a solution
- reduce a space to a smaller taste space that is compact and robust
Dimensionality reduction
- Reduces dimension of a problem
- Richer neighbor network
- Need to experiment to find an appropriate value of k
Challenges
- Missing values need to be filled
- High computational complexity: $O(m^2n + n^3)$
- Lack of transparency/explainability
SVD is growing in use, but still not dominant in the field

Formulation

We want to decompose a mxn matrix (where m is the number of users and n is the number of items) $R$ into \[ R=U{\Sigma}V^T\] where $U$ is the user-feature matrix, $V$ is the item-feature matrix and $\Sigma$ are the weights (singular values in linear algebra), and is a diagonal matrix

We only keep the $k$ largest singular values, and delete the rest. This gives us an mxk user-feature matrix $U$ and an nxk item feature matrix $V$ and the weights have k diagonal items. This is known as the best rank-R approx. by global RMSE

The prediction of item i for user a is given by: \[p_{a,i} = \sum_{f=1}^{k}u_{a,f}{\sigma}_fv_{i,f}\]

For new users, or existing users that rate new items, we can get the user-feature vector as below

\[u_a = {\Sigma}V^T{r_a}^T\]

This is referred to as folding. If a new item is added, we can use a corollary of the above approach. This is done since it is computationally expensive to re-compute the SVD

For missing data, impute means or pre-normalize data

Training SVD

Funk-SVD based on a gradient-descent approach used by Simon Funk for the Netflix prize.

Train each feature one at a time
Ignore missing values
Learn offset from personalized mean
Fold singular values into left-right matrices

We use gradient descent as follows:

** Add notes for the rest of the SVD lesson **

Extending Matrix Factorization

The predicted rating of user a for item i according to the linear model is given by:

\[ r_{a,i} = \mu + b_a + b_i + {\alpha}t_{week}\] where $b_a$ is the user bias, and $b_i$ is the item bias, and the last term represents any day-of-the-week bias

According to the Funk-SVD calculation, we have

\[ r_{a,i} = \mu + b_a + b_i + u_a*v_i\]

Thus, the SVD is a version of the linear regression model with latent features

The SVD++ modele extends this by user preference encoded as additional features

\[ r_{a,i} = b_{a,i} + ({\mu}_a + \frac{1}{N_{ra}}\sum_{j{\in}i}y_i).v_i\]

where $u_a$ is the user-feature vector

SVD feature generalizes this further

CLiMF optimizes Mean Reciprocal Rank (MRR) of a user’s recommendation

Factorization Machines (combines matrix factorization with support vector machines)

References

J. Konstan & J. Riedl Deconstructing Recommender Systems, IEEE Spectrum, Sep. 2012
Evan Miller, How not to sort by average rating
Kamba, Bharat, and Albers (WWW ‘95) Krakatoa Chronicle
J. Herlocker, J. Konstan & J. Rield, Explaining Collaborative Filtering, CSCW 2001.
Michael D. Ekstrand, John T. Riedl, and Joseph A. Konstan, Collaborative Filtering Recommender Systems,Foundations and Trends® in Human-Computer Interaction 4, 2 (February 2011), pp 81–173.
Resnick, P. and Sami, R. 2007. The influence limiter: provably manipulation-resistant recommender systems. In Proceedings of the 2007 ACM Conference on Recommender Systems (Minneapolis, MN, USA, October 19 - 20, 2007). RecSys ’07. ACM, New York, NY, 25-32.

Introduction to Recommender Systems

Niranjan Shetty

April 28, 2016