RecommenderSystems20210917

Chapter One: An Introduction to Recommender Systems

ONE: Introduction

-Basic Idea: using different data sources to infer customer interests.

-Basic principle: significant dependencies exist between user- and item-centric activity

\(m \times n\) rating matrix denoted by \(R\).

-User: recommendation; \(m\) is the number of users.

-item: product being recommended; \(n\) is the number of items

\(r_{ij}\): the observed rating of user \(i\) for item \(j\).

Note: similar to response matrix in DCM.

Respondent/User	Item1	Item2	Item3	Item4	Item5	Item6	Item7	Item8
Respondent1/User1	1	0	na	0	na	1	1	na
Respondent2/User2	0	1	1	na	0	na	1	1
Respondent3/User3	na	0	0	1	1	1	0	na
Respondent4/User4	0	1	1	na	0	1	0	na

Note: we have the following\(Q-matrix\) design in DCM, which is pre-defined by content experts. The \(Q-matrix\) can have simple or complex structures.

Item	Attribute1	Attribute2	Attribute3	Attribute4	Attribute5	Attribute6
Item1	1	0	1	1	0	1
Item2	0	1	1	1	0	1
Item3	1	0	0	0	0	1
Item4	0	1	1	0	1	0
Item5	0	1	0	1	1	0
Item6	0	1	1	1	0	1
Item7	1	0	0	0	1	1
Item8	1	1	1	0	1	0

Collaborative filtering: using ratings from multiple users in a collaborative way to predict missing ratings.

DCM: using responses from different respondents to predict attributes and missing attributes.

TWO: Basic Models

a. Collaborative filtering models

Main challenge: underlying ratings matrices are sparse (but can be imputed)

memory-based methods: neighborhood-based algorithms; regression-based models; similarity-based models.

user-based: similarity functions are computed between rows of rating matrix to discover similar users
respondent-based: to discover similar respondents with same mastery profiles.
item-based: Similarity functions are computed between the columns of the ratings matrix to discover similar items.
attribute-based: to discover similar attributes between the columns.
Advantages: simple to implement; easy to explain.
Disadvantages: not working very well with sparse ratings matrices.

model-based methods: predictive parameterized models, such as decision trees, rule-based models, Bayesian methods and latent factor models.

Advantages: working very well with sparse ratings matrices.
Combinations of memory-based and model-based methods provide very accurate results.

Types of Ratings

interval-based: {-2, -1, 0, 1, 2} from extremely dislike to extremely like
ordinal-based: {1, 2, 3, 4, 5} from poor to excellent.
binary: {0, 1} frequently used in psychometric models.
unary: {1} to specify a liking but no disliking. not recommended

	Movie1	Movie2	Movie3	Movie4	Movie5	Movie6
User1	1			1		1
User2		1			1
User3	1	1		1
User4			1			1
User5				1	1
User6	1		1

Figure 1.3: Examples of utility matrices: Unary rating (Aggarwal, 2016; p.12)

utility matrix: utility refers to the amount of profit incurred by recommending that item to the particular user.
unique attribute matrix: the most difficult item correctly responded by the particular respondent, which gives the respondent a unique mastery profile??.
Note: substitution of missing entries with any value leads to a significant amount of bias

Aggarwal, 2016; p. 13

Collaborative filtering models with missing value analysis

Collaborative filtering models are closely related to missing value analysis.
Collaborative filtering as a generalization of classification and regression modeling.

b. Content-Based Recommender Systems

c. Knowledge-Based Recommender Systems

d. Demographic Recommender Systems

e. Hybrid and Ensemble-Based Recommender Systems

Aggarwal, 2016; p. 13

f. Evaluation -orthogonal to each other

Accuracy
Coverage
Confidence and Trust
Novelty 新颖度
Serendipity 惊喜度
Diversity 多样化
Robustness and Stability
Scalability
Interpretability (my thoughts)

THREE: Domain-Specific Challenges

a. Context-Sensitive (multidimensional?? multi-level??)
b. Time-Sensitive(process data??)
c. Location-Based (online vs offline??)
d. Social

FOUR: Advanced Topics and Applications

FIVE: Summary

Chapter Two: Neighborhood-Based Collaborative Filtering

ONE: Introduction (p. 29-p.31)

Neighborhood-based collaborative filtering algorithms : memory-based algorithm

user-based: the ratings provided by similar users to a target user A to make recommendations for A. The predicted ratings of A are computed as the weighted average values of these “peer group” ratings for each item.
respondent-based: the assumption here: respondents respond to each item independently, no dependency. Independent and identically distributed (IID).
item-based: to determine a set S of similar items to target item B. The weighted average of these ratings is used to compute the predicted rating of user A for item B.
item-based: assumption: IID for items..
Conclusion: Different assumptions here and in educational assessments: We assume conditional independence or local independence. i.e., the responses to an item are independent of the responses to any other item conditional on the respondent’s ability\(\theta\). In the cases of interdependency, additional latent variables such as gender, ethnicity, speededness,etc. are needed.

TWO: Key Properties of Rating Matrices *(p. 31-p.33)

Continuous ratings: any values
Interval-based ratings: 5-point or N-point scales, equidistant.
Ordinal rating: similar to interval-based, also ordered categorical; difference between pair of adjacent ratings values.
Binary ratings: two options (0/1) (M/F)
Unary rating: one option only

THREE: Predicting Ratings with Neighborhood-Based Methods (p. 33-p.45)

1. User-based Neighborhood Models: similarity functions are computed between rows of rating matrix to discover similar users.

respondent-based: IID assumptions.
Step One: Compute the mean rating \(\mu_u\) for each user \(u\) using her specified ratings:
\(\mu_{u}=\frac{\sum_{k \in I_{u}} r_{u k}}{\left|I_{u}\right|} \quad \forall u \in\{1 \ldots m\}\) (2.1; p. 35)
Cosine function on the raw ratings: is it the right one?? (2.5, p. 37)
\(\operatorname{Raw} \operatorname{Cosine}(u, v)=\frac{\sum_{k \in I_{u} \cap I_{v}} r_{u k} \cdot r_{v k}}{\sqrt{\sum_{k \in I_{u} \cap I_{v}} r_{u k}^{2}} \cdot \sqrt{\sum_{k \in I_{u} \cap I_{v}} r_{v k}^{2}}}\)
Step Two: Pearson correlation coefficient between rows(users) \(u\) and \(v\):
\(\operatorname{Sim}(u, v)=\operatorname{Pearson}(u, v)=\frac{\sum_{k \in I_{u} \cap I_{v}}\left(r_{u k}-\mu_{u}\right) \cdot\left(r_{v k}-\mu_{v}\right)}{\sqrt{\sum_{k \in I_{u} \cap I_{v}}\left(r_{u k}-\mu_{u}\right)^{2}} \cdot \sqrt{\sum_{k \in I_{u} \cap I_{v}}\left(r_{v k}-\mu_{v}\right)^{2}}}\) (2.2; p. 35)

covariance by the product of the two variables’ standard deviations.
The Pearson correlation coefficient is preferable to the raw cosine because of the bias adjustment effect of meaning-centering.
Step Three: mean-centered rating \(s_{uj}\):
\(s_{u j}=r_{u j}-\mu_{u} \quad \forall u \in\{1 \ldots m\}\) (2.3; p. 35)
Step Four: overall neighborhood-based prediction function (p.36):

\(\hat{r}_{u j}=\mu_{u}+\frac{\sum_{v \in P_{u}(j)} \operatorname{Sim}(u, v) \cdot s_{v j}}{\sum_{v \in P_{u}(j)}|\operatorname{Sim}(u, v)|}=\mu_{u}+\frac{\sum_{v \in P_{u}(j)} \operatorname{Sim}(u, v) \cdot\left(r_{v j}-\mu_{v}\right)}{\sum_{v \in P_{u}(j)}|\operatorname{Sim}(u, v)|}\) (2.4; p.36)

Aggarwal, 2016; p. 34 Example: (p.36)

Similarity function variants
variants of the prediction function
Variations in Filtering peer groups
impact of the long tail

Aggarwal, 2016; p. 33

2. Item-based Neighborhood Models: Similarity functions are computed between the columns of the ratings matrix to discover similar items. (p.40)

item-based: IID assumptions.
The adjusted cosine similarity between the items (columns) \(i\) and \(j\) is defined as follows:
\(\operatorname{Adjusted} \operatorname{Cosine}(i, j)=\frac{\sum_{u \in U_{i} \cap U_{j}} s_{u i} \cdot s_{u j}}{\sqrt{\sum_{u \in U_{i} \cap U_{j}} s_{u i}^{2}} \cdot \sqrt{\sum_{u \in U_{i} \cap U_{j}} s_{u j}^{2}}}\) (2.14, p.40)
The predicted rating \(\hat{r}_{ut}\) of user \(u\) for target item \(t\) is as follows:
\(\hat{r}_{u t}=\frac{\sum_{j \in Q_{t}(u)} \operatorname{Adjusted} \operatorname{Cosine}(j, t) \cdot r_{u j}}{\sum_{j \in Q_{t}(u)}|\operatorname{Adjusted} \operatorname{Cosine}(j, t)|}\) (2.15, p.40)

Example (p.40- p.41)

3. Efficient Implementation and Computational Complexity

Neighborhood-based methods is used to determine the best item recommendations for a target user or the best user recommendations for a target item.
We have more users than the number of items in real situations.

4. Comparing User-Based and Item-Based Methods (p.42-p.44)

Item-based Neighborhood Models: more relevant recommendations, better accuracy, more robust, concrete reasons for a recommendation, more stable with changes to the ratings. How about diversity, serendipity, novelty, etc.
User-based Neighborhood Models: Different explanations.

Aggarwal, 2016; p. 43 5. Strengths and Weaknesses of Neighborhood-Based Methods (p.44)

simple and intuitive;
easy to implement and debug;
easy to justify;
not easily available in model-based methods
impractical in large-scale settings.

6. Unified View

FOUR: Clustering and Neighborhood-Based Methods (p. 45-p.47)

Main idea of clustering-based methods: replace offline nearest-neighbor computation phase with clustering.
the trade-off between accuracy and efficiency.
the rating matrix is incomplete. massive incomplete data sets.
clusters are dependent on the representatives.
purpose of clustering here: to reduce the computational complexity. It divided the original matrices into smaller matrices.

FIVE: Dimensionality Reduction and Neighborhood-Based Methods **(p. 47-p.48)

Dimensionality Reduction: improve neighborhood-based methods in quality and efficiency. (p.47)
Dimensionality Reduction: latent factor models.
singular value decomposition (SVD)-like methods. Netflix??? SVD for large-scale recommendations??? reduce the rank???. An approach for dimensionality reduction.
principal component analysis (PCA)-like methods.
Bias (p.49-p.51)
Bias: caused by filling the unspecified entries with average values along either the rows or columns.
Maximum Likelihood Estimation: covariance matrix is computed.

Aggarwal, 2016; p.50

Direct matrix factorization of incomplete data (p.50)

When the matrix is sparse, the covariance estimates will be statistically unreliable. Matrix factorization methods such as singular value decomposition (SVD) are suggested.

One can approximately factorize the matrix by using truncated SVD, which can be computed as follows:

\(R=Q\Sigma P^T\) (2.18; p.51)

SIX: Regression Modeling View of Neighborhood-Based Methods (p. 51-p.60)

1. User-Based Nearest Neighbor Regression (p.53)

The predicted rating of target user \(u\) for item \(j\) is as follows:
\(\hat{r}_{u j}=\mu_{u}+\frac{\sum_{v \in P_{u}(j)} \operatorname{Sim}(u, v) \cdot\left(r_{v j}-\mu_{v}\right)}{\sum_{v \in P_{u}(j)}|\operatorname{Sim}(u, v)|}\) (2.22; p. 53)
The least-squares objective function is as follows:
\(\begin{aligned} \text { Minimize } J_{u} &=\sum_{j \in I_{u}}\left(r_{u j}-\hat{r}_{u j}\right)^{2} \\ &=\sum_{j \in I_{u}}\left(r_{u j}-\left[\mu_{u}+\sum_{v \in P_{u}(j)} w_{v u}^{u s e r} \cdot\left(r_{v j}-\mu_{v}\right)\right]\right)^{2} \end{aligned}\) (p.55)
The consolidated optimization problem is as follows:
\(\operatorname{Minimize} \sum_{u=1}^{m} J_{u}=\sum_{u=1}^{m} \sum_{j \in I_{u}}\left(r_{u j}-\left[\mu_{u}+\sum_{v \in P_{u}(j)} w_{v u}^{u s e r} \cdot\left(r_{v j}-\mu_{v}\right)\right]\right)^{2}\) (2.23, p.54)
The User-Based Nearest Neighbor Regression Model: (2.26; p.55)
\(\hat{r}_{u j}=b_{u}^{\text {user }}+b_{j}^{i t e m}+\frac{\sum_{v \in P_{u}(j)} w_{v u}^{\text {user }} \cdot\left(r_{v j}-b_{v}^{\text {user }}-b_{j}^{\text {item }}\right)}{\sqrt{\left|P_{u}(j)\right|}}\) (2.26; p.55)
sparsity and Bias

2. Item-Based Nearest Neighbor Regression (p.55)

The predicted rating of target user \(u\) for item \(t\) is as follows:
\(\hat{r}_{u t}=\sum_{j \in Q_{t}(u)} w_{j t}^{i t e m} \cdot r_{u j}\) (2.27, p.55)
The least-squares objective function is as follows:
\(\begin{aligned} \text { Minimize } J_{t} &=\sum_{u \in U_{t}}\left(r_{u t}-\hat{r}_{u t}\right)^{2} \\ &=\sum_{u \in U_{t}}\left(r_{u t}-\sum_{j \in Q_{t}(u)} w_{j t}^{i t e m} \cdot r_{u j}\right)^{2} \end{aligned}\) (p.56)
The consolidated optimization problem is as follows:
\(\operatorname{Minimize} \sum_{t=1}^{n} \sum_{u \in U_{t}}\left(r_{u t}-\sum_{j \in Q_{t}(u)} w_{j t}^{i t e m} \cdot r_{u j}\right)^{2}\) (2.28; p.56)
The Item-Based Nearest Neighbor Regression Model: (2.30; p.55)
\(\hat{r}_{u t}=b_{u}^{u s e r}+b_{t}^{i t e m}+\frac{\sum_{j \in Q_{t}(u)} w_{j t}^{i t e m} \cdot\left(r_{u j}-B_{u j}\right)}{\sqrt{\left|Q_{t}(u)\right|}}\) (2.30; p.56)

3. Combing Two Methods (p.57)

The combined model can be expressed as follows:
\(\hat{r}_{u j}=b_{u}^{u s e r}+b_{j}^{i t e m}+\frac{\sum_{v \in P_{u}(j)} w_{v u}^{\text {user }} \cdot\left(r_{v j}-B_{v j}\right)}{\sqrt{\left|P_{u}(j)\right|}}+\frac{\sum_{j \in Q_{t}(u)} w_{j t}^{i t e m} \cdot\left(r_{u j}-B_{u j}\right)}{\sqrt{\left|Q_{t}(u)\right|}}\) (2.31, p.57)

4. Joint Interpolation with Similarity Weighting (p.57)

The objective function:
\(\begin{aligned} \text { Minimize } & \sum_{s:(u, s) \in S} \sum_{j: j \neq s} \operatorname{AdjustedCosine}(j, s) \cdot\left(r_{u s}-\hat{r}_{u j}\right)^{2} \\ &=\sum_{s:(u, s) \in S j: j \neq s} \sum_{j d j u s t e d C o s i n e}(j, s) \cdot\left(r_{u s}-\left[\mu_{u}+\sum_{v \in P_{u}(j)} w_{v u}^{u s e r} \cdot\left(r_{v j}-\mu_{v}\right)\right]\right)^{2} \end{aligned}\) (p.57)

5. Sparse Linear Models (SLIM) (p.58)

the prediction function in SLIM is:
\(\hat{r}_{u t}=\sum^{n} w_{j t}^{i t e m} \cdot r_{u j} \forall u \in\{1 \ldots m\}, \forall t \in\{1 \ldots n\}\) (2.33; p.58)
the objective function is:
\(\begin{aligned} \text { Minimize } J_{t}^{s} &=\sum_{u=1}^{m}\left(r_{u t}-\hat{r}_{u t}\right)^{2}+\lambda \cdot \sum_{j=1}^{n}\left(w_{j t}^{i t e m}\right)^{2}+\lambda_{1} \cdot \sum_{j=1}^{n}\left|w_{j t}^{i t e m}\right| \\ &=\sum_{u=1}^{m}\left(r_{u t}-\sum_{j=1}^{n} w_{j t}^{i t e m} \cdot r_{u j}\right)^{2}+\lambda \cdot \sum_{j=1}^{n}\left(w_{j t}^{i t e m}\right)^{2}+\lambda_{1} \cdot \sum_{j=1}^{n}\left|w_{j t}^{i t e m}\right| \\ \text { subject to: } \end{aligned}\) \[ \begin{aligned} &w_{j t}^{i t e m} \geq 0 \forall j \in\{1 \ldots n\} \\ &w_{t t}^{\text {item }}=0 \end{aligned} \] (p.59)

SEVEN: Graph Models for Neighborhood-Based Methods **(p. 60-p.67)

1. User-Item Graphs (p.61-p.63) Aggarwal, 2016; p. 62

2. User-User Graphs (p.63-p.65)

Aggarwal, 2016; p. 64

3. Item-Item Graphs (p.66-p.67)

Aggarwal, 2016; p. 66

EIGHT: Summary(p. 67)

Chapter Three: Model-Based Collaborative Filtering (p.71-p.138)

ONE: Introduction (p.71- p.74)

Recall CH2 The neighbor-based methods: generalizations of \(k-\)nearest neighbor classifiers, in which the prediction approach is specific to the instance being predicted. For example, in user-based neighborhood methods, the peers of the target user are determined in order to perform the prediction.

Related neighbor-based methods: User-based, item-based, unified, clustering, dimensionality reduction, nearest neighbor regression.

Model-based methods: the training (or model-building phase) is clearly separated from the prediction phase. (p.71)

Related Model-based methods: decision trees, rule-based methods, Bayes classifier, regression models, support vector machine (SVM), and neural networks. (p.71)

Use some of them for classification tasks in educational assessments??.

Aggarwal, 2016; p. 72

Differences between classification tasks and matrix completion tasks (collaborative filtering) (p.72-p.73)

1. Classification: a. clear separation between feature (independent) variable and class (dependent) variables; b. clear separation between training and test data; c. columns-features, rows-data instances.
1. Matrix completion (generalization of the classification or regression modeling): a. no clear separation; b. no clear separation; c. transpose is possible (user-wise or item-wise) in collaborative filtering. it has greater generality, leading to a richer number of algorithmic possibilities.

Advantages of Model-based recommender systems over neighborhood-based methods (p.73)

Space-efficiency.
Training and prediction speed.

TWO: Decision and Regression Trees (p.74-p.77)

Decision and regression trees: classification tasks.
Decision trees: DVs are categorical. It is a hierarchical partitioning of the data space with the use of a set of hierarchical decisions.
Gini index G(S) of the node:

\(G(S)=1-\sum_{i=1}^{r} p_{i}^{2}\) (3.1, p.74)

which lies between 0 and 1, with smaller values being more indicative of greater discriminative power.

Gini index of the split S (\(S_1, S_2\)) may be evaluated as:

\(\operatorname{Gini}\left(S \Rightarrow\left[S_{1}, S_{2}\right]\right)=\frac{n_{1} \cdot G\left(S_{1}\right)+n_{2} \cdot G\left(S_{2}\right)}{n_{1}+n_{2}}\) (3.2, p.74)

Aggarwal, 2016; p. 75

Here, the attribute with smallest Gini index is selected for the split.

1. Extending Decision Tree to Collaborative Filtering. (p.76-p.77)

The number of decision trees constructed is exactly equal to the number n of attributes (items). (p.76)
Dimensionality reduction methods: lower-dimensional representation of data. {style=“color:red”} (p.77)
Regression tress: DVs are numerical

THREE: Rule-Based Collaborative Filtering (p.77-p.81)

Association rules are naturally defined over binary data. Here unary data sets are used.

Aggarwal, 2016; p. 78

The key in association rule mining is to determine sets of items that are closely correlated in teh transaction database by Support (s) and Confidence (c).
Association Rules: (p.78).

Aggarwal, 2016; p. 78

Comments:Unary transection data is used here, not related to educational assessment (omitted)

FOUR: Naive Bayes Collaborative Filtering (p.77-p.81)

The naive Bayes model is a generative model for classification, which is expressed as (3.4, p.82):

\(P\left(r_{u j}=v_{s} \mid\right.\) Observed ratings in \(\left.I_{u}\right)=\frac{P\left(r_{u j}=v_{s}\right) \cdot P\left(\text { Observed ratings in } I_{u} \mid r_{u j}=v_{s}\right)}{P\left(\text { Observed ratings in } I_{u}\right)}\)

which is also expressed as (3.5, p.82):

\[ P\left(r_{u j}=v_{s} \mid \text { Observed ratings in } I_{u}\right) \propto P\left(r_{u j}=v_{s}\right) \cdot P\left(\text { Observed ratings in } I_{u} \mid r_{u j}=v_{s}\right) \]

Here the conditional probability of (3.5) on the right side can be expressed as:

\[ P\left(\text { Observed ratings in } I_{u} \mid r_{u j}=v_{s}\right)=\prod_{k \in I_{u}} P\left(r_{u k} \mid r_{u j}=v_{s}\right) \] - The value of \(P\left(r_{u k} \mid r_{u j}=v_{s}\right)\) is estimated as the fraction of users that have specified the rating of \(r_{uk}\) for the \(k^{th}\) item, given they have specified the rating of their \(j^{th}\) item to \(v_s\).

The posterior probability of the rating of item \(j\) for user \(u\) is as follows: (3.7, p.83)

\[ P\left(r_{u j}=v_{s} \mid \text { Observed ratings in } I_{u}\right) \propto P\left(r_{u j}=v_{s}\right) \cdot \prod_{k \in I_{u}} P\left(r_{u k} \mid r_{u j}=v_{s}\right) \] - *Example with Binary Ratings** (p.85)

**Comments: it is commonly used in educational assessment.

FIVE: Using an Arbitrary Classification Model as a Black-Box (p.86-p.90)

An off-the-shelf classification algorithm can be used as a black-box to predict the ratings of one of the items with the ratings of other items. (p.86)
Example: Using a Neural Network as a Black-Box. (p.88-p90)
Artificial neural networks (ANNs): basic computation unit (neurons), weights (strengths of synaptic 突触 connections); basic architecture (perceptron 感知器-a set of input nodes and an output node) (p.87-p.88). In general, an ANN can have multiple layers, and intermediate nodes can compute nonlinear functions (Figure 3.3(b)) (p.89).
Advantages of ANNs: to compute complex nonlinear functions, for complex classification tasks. Regularization can be used to reduce the impact of noise.
Recommendations: only users with valid ratings of items are used; users should not all be highly similar to one another (preselect muturally diverse users in the initial phase)

Comments: Maybe I can try to use ANNs for complex classification tasks in educational assessment???

Aggarwal, 2016; p. 88 - The signed linear function for binary output is as follows: (3.9, p.88)

\(z_{i}=\operatorname{sign}\left\{\bar{W} \cdot \overline{X_{i}}+b\right\}\)

The updated function for the neural networks is as follows: (3.10, p.89)

\(\bar{W}^{t+1}=\bar{W}^{t}+\alpha\left(y_{i}-z_{i}\right) \overline{X_{i}}\)

\(\alpha > 0\) is the learning rate
\(\bar{W}^{t}\) is the value of weight vector in the \(t^{th}\) iteration

SIX: Latent Factor Models 重点！！！！！ (p.90-p.113)

Latent factor models: is state-of-the-art (SOTA) in collaborative filtering in recommender systems (p.91).
The family of matrix factorization methods (p.127)

Aggarwal, 2016; p. 127 1. Geometric Intuition for Latent Factor Models (p. 91-p.92)

Dimension Reduction Methods: Principal component Analysis (PCA); (Mean-centered) Singular Value Decomposition (SVD)-leverage the inter-attribute correlations and redundancies to infer unspecified entries (Comments: sparse data or missing data??).
We must use a partially specified data set to recover the low-dimensional hyperplane on which the data approximately lies (p.93)
Why? p.92: to implicitly capture the underlying redundancies in the correlation structure of the data and reconstruct all the missing values in one shot. If the data does not have any correlations or redundancies, a latent factor model will simply not work.

2. Low-Rank Intuition for Latent Factor Models (p. 93-p.92)

Factorization: approximating a matrix when it is prone to dimensionality reduction because of correlations between columns (or rows), so most dimensionality reduction methods are expressed as matrix factorizations.

3. Basic Matrix Factorization Principles (p. 93-p.92)

Aggarwal, 2016; p. 95 - The \(m \times N\) rating matrix \(R\) is approximately factorized into an \(m\times k\)matrix \(U\) and an \(n \times K\) matrix \(V\) is as follows:

\(R \approx U V^{T}\) (3.13, p.94)

Each rating \(r_{ij}\) in \(R\) can be expressed as:

\(r_{i j} \approx \overline{u_{i}} \cdot \overline{v_{j}}\) (3.14, p.94)

An intuitive way of Equation 3.14 is

\(\begin{aligned} r_{i j} & \approx \sum_{s=1}^{k} u_{i s} \cdot v_{j s} \\ &=\sum_{s=1}^{k}(\text { Affinity of user } i \text { to concept s) } \times(\text { Affinity of item } j \text { to concept s })\end{aligned}\)

For Figure 3.7, it can be expressed as:

\[ \begin{aligned} r_{i j} \approx &(\text { Affinity of user } i \text { to history }) \times(\text { Affinity of item } j \text { to history }) \\ &+(\text { Affinity of user } i \text { to romance }) \times(\text { Affinity of item } j \text { to romance }) \end{aligned} \]

[The differences of constraints on \(U\) and \(V\), and the nature of objective functions play a key role in the usability of the matrix factorization model in real-world scenarios.] {style=“color:red”}(p.96)

4. Unconstrained Matrix Factorization (a lot of details) (p.96-p.113)

The most fundamental form of matrix factorization is the unconstrained case–no constraints on \(U\) and \(V\).

Matrix Norm

Singular value decomposition (SVD): columns of \(U\) and \(V\) are orthogonal.
An optimization problem regarding \(U\) and \(V\) of its objective function is formulated as:

Minimize \(J=\frac{1}{2}\left\|R-U V^{T}\right\|^{2}\) subject to: No constraints on \(U\) and \(V\)

The set \(S\) of observed user-item pairs is defined as follows: (3.15, p.97)

\[ S=\left\{(i, j): r_{i j} \text { is observed }\right\} \]

The modified objective function for incomplete matrices is as follows:

Minimize \(J=\frac{1}{2} \sum_{(i, j) \in S} e_{i j}^{2}=\frac{1}{2} \sum_{(i, j) \in S}\left(r_{i j}-\sum_{s=1}^{k} u_{i s} \cdot v_{j s}\right)^{2}\) subject to: No constraints on \(U\) and \(V\)

Gradient descent: (p.98)

Aggarwal, 2016; p. 98 - The partial derivative of \(J\) to \(u_{jq}\) and \(v_{jq}\):

Aggarwal, 2016; p. 98

The updates to be convergent: (batch update method, p99)

\(U \Leftarrow U+\alpha E V\) \(V \Leftarrow V+\alpha E^{T} U\)

Stochastic Gradient Descent (p.99)

Stochastic Gradient Descent is used for optimization, efficient, but sensitive, not very stable.

\[ \begin{aligned} &u_{i q} \Leftarrow u_{i q}-\alpha \cdot\left[\frac{\partial J}{\partial u_{i q}}\right]_{\text {Portion contributed by }(i, j)} \quad \forall q \in\{1 \ldots k\} \\ &v_{j q} \Leftarrow v_{j q}-\alpha \cdot\left[\frac{\partial J}{\partial v_{j q}}\right]_{\text {Portion contributed by }(i, j)} & \forall q \in\{1 \ldots k\} \end{aligned} \] - The 2-K updates *specific to the observed entry \((i, j)\in S\) are:

\[ \begin{array}{ll} u_{i q} \Leftarrow u_{i q}+\alpha \cdot e_{i j} \cdot v_{j q} & \forall q \in\{1 \ldots k\} \\ v_{j q} & \Leftarrow v_{j q}+\alpha \cdot e_{i j} \cdot u_{i q} & \forall q \in\{1 \ldots k\} \end{array} \] - K-dimensional vectorized form:

\[ \begin{aligned} &\overline{u_{i}} \Leftarrow \overline{u_{i}}+\alpha e_{i j} \overline{v_{j}} \\ &\overline{v_{j}} \Leftarrow \overline{v_{j}}+\alpha e_{i j} \overline{u_{i}} \end{aligned} \] - Regularization (p.100-p.103)

Regularization: to reduce the tendency of the model to overfit at the expense of introducing a bias.

Incremental Latent Component Training (p.103-104)

Aggarwal, 2016; p. 100 - Alternating Least Squares (ALS) and Coordinate Descent (p.105)

Alternating Least Squares and Coordinate Descent is also used for optimization, and it is more stable comparing to Stochastic Gradient Descent, but it is not very efficient.

Incorporating User and Item Biases (p.106)

Incorporating User and Item Biases is a variation of unconstrained model. Using only the bias variables can often provide reasonable good rating predictions (p.109).

Aggarwal, 2016; p. 108 - Incorporating Implicit Feedback (p.109-p.113)

Asymmetric factor models and SVD++ have been proposed to incorporate implicit feedback.

5. Singular Value Decomposition (p.113-p.119)

It is a form of matrix factorization in which the columns of \(U\) and \(V\) are constrainted to be mutually orthogonal.
Truncated SVD is computed as:

\(R \approx Q_{k} \Sigma_{k} P_{k}^{T}\)

The user factors and item factors are defined as (3.22, p.113):

\(U=Q_{k} \Sigma_{k}\) \(V=P_{k}\)

SVD is formulated as the following optimization problem:

Minimize \(J=\frac{1}{2}\left\|R-U V^{T}\right\|^{2}\) subject to: Columns of \(U\) are mutually orthogonal Columns of \(V\) are mutually orthogonal

Example of SVD (p.117-119)

6. Non-negative Matrix Factorization (p.119-p.125)

It is used for ratings matrices that are non-negative. This approach is not necessarily one of accuracy, but of high level of interpretability.

Aggarwal, 2016; p. 122

7. Understanding the Matrix Factorization Family (p.126-p.128)

The goal of objective function/loss function is to make \(UV^{T}\) approximate the ratings matrix \(R\) as closely as possible.
The Matrix Factorization Family can be written as follows:

Aggarwal, 2016; p. 126

SIX: Latent Factor Models 重点结束！！！！！ (p.90-p.113)

SEVEN: Integrating Factorization and Neighborhood Models (p.128-p.134)

1. Baseline Estimator: A Non-Personalized Bias-Centric Model (p.128)
1. Neighborhood Portion of Model(p.129)
1. Latent Factor Portion of Model (p.130)
1. Integrating the Neighborhood and Latent Factor Portions (p.131)
1. Solving the Optimization Model (p.131-p.132)
1. Accuracy (p.132-p.133)
1. Integrating Latent factor Models with Arbitrary Models (p.133)

EIGHT: Summary (p.134)

The collaborative filtering problem can be viewed as a generalization of the problem of classification.
Latent factor models are highly tailored to the collaborative filtering problem with different types of factorization using to predict rating.
Issues to consider: objective functions; constraints on basis matrices; trade-offs in accuracy, overfitting, and intepretability.
Latent factor models are the state-of-the-art (SOTA) in collaborative filtering. It can be combined with neighborhood methods to be more powerful.