-Basic Idea: using different data sources to infer customer interests.
-Basic principle: significant dependencies exist between user- and item-centric activity
-User: recommendation; \(m\) is the number of users.
-item: product being recommended; \(n\) is the number of items
\(r_{ij}\): the observed rating of user \(i\) for item \(j\).
Note: similar to response matrix in DCM.
| Respondent/User | Item1 | Item2 | Item3 | Item4 | Item5 | Item6 | Item7 | Item8 |
|---|---|---|---|---|---|---|---|---|
| Respondent1/User1 | 1 | 0 | na | 0 | na | 1 | 1 | na |
| Respondent2/User2 | 0 | 1 | 1 | na | 0 | na | 1 | 1 |
| Respondent3/User3 | na | 0 | 0 | 1 | 1 | 1 | 0 | na |
| Respondent4/User4 | 0 | 1 | 1 | na | 0 | 1 | 0 | na |
Note: we have the following\(Q-matrix\) design in DCM, which is pre-defined by content experts. The \(Q-matrix\) can have simple or complex structures.
| Item | Attribute1 | Attribute2 | Attribute3 | Attribute4 | Attribute5 | Attribute6 |
|---|---|---|---|---|---|---|
| Item1 | 1 | 0 | 1 | 1 | 0 | 1 |
| Item2 | 0 | 1 | 1 | 1 | 0 | 1 |
| Item3 | 1 | 0 | 0 | 0 | 0 | 1 |
| Item4 | 0 | 1 | 1 | 0 | 1 | 0 |
| Item5 | 0 | 1 | 0 | 1 | 1 | 0 |
| Item6 | 0 | 1 | 1 | 1 | 0 | 1 |
| Item7 | 1 | 0 | 0 | 0 | 1 | 1 |
| Item8 | 1 | 1 | 1 | 0 | 1 | 0 |
Collaborative filtering: using ratings from multiple users in a collaborative way to predict missing ratings.
DCM: using responses from different respondents to predict attributes and missing attributes.
Main challenge: underlying ratings matrices are sparse (but can be imputed)
user-based: similarity functions are computed between rows of rating matrix to discover similar users
respondent-based: to discover similar respondents with same mastery profiles.
item-based: Similarity functions are computed between the columns of the ratings matrix to discover similar items.
attribute-based: to discover similar attributes between the columns.
Advantages: simple to implement; easy to explain.
Disadvantages: not working very well with sparse ratings matrices.
Advantages: working very well with sparse ratings matrices.
Combinations of memory-based and model-based methods provide very accurate results.
Types of Ratings
interval-based: {-2, -1, 0, 1, 2} from extremely dislike to extremely like
ordinal-based: {1, 2, 3, 4, 5} from poor to excellent.
binary: {0, 1} frequently used in psychometric models.
unary: {1} to specify a liking but no disliking. not recommended
| Movie1 | Movie2 | Movie3 | Movie4 | Movie5 | Movie6 | |
|---|---|---|---|---|---|---|
| User1 | 1 | 1 | 1 | |||
| User2 | 1 | 1 | ||||
| User3 | 1 | 1 | 1 | |||
| User4 | 1 | 1 | ||||
| User5 | 1 | 1 | ||||
| User6 | 1 | 1 |
Figure 1.3: Examples of utility matrices: Unary rating (Aggarwal, 2016; p.12)
utility matrix: utility refers to the amount of profit incurred by recommending that item to the particular user.
unique attribute matrix: the most difficult item correctly responded by the particular respondent, which gives the respondent a unique mastery profile??.
Note: substitution of missing entries with any value leads to a significant amount of bias
Aggarwal, 2016; p. 13
Collaborative filtering models are closely related to missing value analysis.
Collaborative filtering as a generalization of classification and regression modeling.
Aggarwal, 2016; p. 13
Accuracy
Coverage
Confidence and Trust
Novelty 新颖度
Serendipity 惊喜度
Diversity 多样化
Robustness and Stability
Scalability
Interpretability (my thoughts)
a. Context-Sensitive (multidimensional?? multi-level??)
b. Time-Sensitive(process data??)
c. Location-Based (online vs offline??)
d. Social
Neighborhood-based collaborative filtering algorithms : memory-based algorithm
user-based: the ratings provided by similar users to a target user A to make recommendations for A. The predicted ratings of A are computed as the weighted average values of these “peer group” ratings for each item.
respondent-based: the assumption here: respondents respond to each item independently, no dependency. Independent and identically distributed (IID).
item-based: to determine a set S of similar items to target item B. The weighted average of these ratings is used to compute the predicted rating of user A for item B.
item-based: assumption: IID for items..
Conclusion: Different assumptions here and in educational assessments: We assume conditional independence or local independence. i.e., the responses to an item are independent of the responses to any other item conditional on the respondent’s ability\(\theta\). In the cases of interdependency, additional latent variables such as gender, ethnicity, speededness,etc. are needed.
Continuous ratings: any values
Interval-based ratings: 5-point or N-point scales, equidistant.
Ordinal rating: similar to interval-based, also ordered categorical; difference between pair of adjacent ratings values.
Binary ratings: two options (0/1) (M/F)
Unary rating: one option only
1. User-based Neighborhood Models: similarity functions are computed between rows of rating matrix to discover similar users.
respondent-based: IID assumptions.
Step One: Compute the mean rating \(\mu_u\) for each user \(u\) using her specified ratings:
\(\mu_{u}=\frac{\sum_{k \in I_{u}} r_{u k}}{\left|I_{u}\right|} \quad \forall u \in\{1 \ldots m\}\) (2.1; p. 35)
Cosine function on the raw ratings: is it the right one?? (2.5, p. 37)
\(\operatorname{Raw} \operatorname{Cosine}(u, v)=\frac{\sum_{k \in I_{u} \cap I_{v}} r_{u k} \cdot r_{v k}}{\sqrt{\sum_{k \in I_{u} \cap I_{v}} r_{u k}^{2}} \cdot \sqrt{\sum_{k \in I_{u} \cap I_{v}} r_{v k}^{2}}}\)
Step Two: Pearson correlation coefficient between rows(users) \(u\) and \(v\):
\(\operatorname{Sim}(u, v)=\operatorname{Pearson}(u, v)=\frac{\sum_{k \in I_{u} \cap I_{v}}\left(r_{u k}-\mu_{u}\right) \cdot\left(r_{v k}-\mu_{v}\right)}{\sqrt{\sum_{k \in I_{u} \cap I_{v}}\left(r_{u k}-\mu_{u}\right)^{2}} \cdot \sqrt{\sum_{k \in I_{u} \cap I_{v}}\left(r_{v k}-\mu_{v}\right)^{2}}}\) (2.2; p. 35)
covariance by the product of the two variables’ standard deviations.
The Pearson correlation coefficient is preferable to the raw cosine because of the bias adjustment effect of meaning-centering.
Step Three: mean-centered rating \(s_{uj}\):
\(s_{u j}=r_{u j}-\mu_{u} \quad \forall u \in\{1 \ldots m\}\) (2.3; p. 35)
Step Four: overall neighborhood-based prediction function (p.36):
\(\hat{r}_{u j}=\mu_{u}+\frac{\sum_{v \in P_{u}(j)} \operatorname{Sim}(u, v) \cdot s_{v j}}{\sum_{v \in P_{u}(j)}|\operatorname{Sim}(u, v)|}=\mu_{u}+\frac{\sum_{v \in P_{u}(j)} \operatorname{Sim}(u, v) \cdot\left(r_{v j}-\mu_{v}\right)}{\sum_{v \in P_{u}(j)}|\operatorname{Sim}(u, v)|}\) (2.4; p.36)
Example: (p.36)
Similarity function variants
variants of the prediction function
Variations in Filtering peer groups
impact of the long tail
Aggarwal, 2016; p. 33
2. Item-based Neighborhood Models: Similarity functions are computed between the columns of the ratings matrix to discover similar items. (p.40)
item-based: IID assumptions.
The adjusted cosine similarity between the items (columns) \(i\) and \(j\) is defined as follows:
\(\operatorname{Adjusted} \operatorname{Cosine}(i, j)=\frac{\sum_{u \in U_{i} \cap U_{j}} s_{u i} \cdot s_{u j}}{\sqrt{\sum_{u \in U_{i} \cap U_{j}} s_{u i}^{2}} \cdot \sqrt{\sum_{u \in U_{i} \cap U_{j}} s_{u j}^{2}}}\) (2.14, p.40)
The predicted rating \(\hat{r}_{ut}\) of user \(u\) for target item \(t\) is as follows:
\(\hat{r}_{u t}=\frac{\sum_{j \in Q_{t}(u)} \operatorname{Adjusted} \operatorname{Cosine}(j, t) \cdot r_{u j}}{\sum_{j \in Q_{t}(u)}|\operatorname{Adjusted} \operatorname{Cosine}(j, t)|}\) (2.15, p.40)
Example (p.40- p.41)
3. Efficient Implementation and Computational Complexity
Neighborhood-based methods is used to determine the best item recommendations for a target user or the best user recommendations for a target item.
We have more users than the number of items in real situations.
4. Comparing User-Based and Item-Based Methods (p.42-p.44)
Item-based Neighborhood Models: more relevant recommendations, better accuracy, more robust, concrete reasons for a recommendation, more stable with changes to the ratings. How about diversity, serendipity, novelty, etc.
User-based Neighborhood Models: Different explanations.
5. Strengths and Weaknesses of Neighborhood-Based Methods (p.44)
simple and intuitive;
easy to implement and debug;
easy to justify;
not easily available in model-based methods
impractical in large-scale settings.
6. Unified View
Main idea of clustering-based methods: replace offline nearest-neighbor computation phase with clustering.
the trade-off between accuracy and efficiency.
the rating matrix is incomplete. massive incomplete data sets.
clusters are dependent on the representatives.
purpose of clustering here: to reduce the computational complexity. It divided the original matrices into smaller matrices.
Dimensionality Reduction: improve neighborhood-based methods in quality and efficiency. (p.47)
Dimensionality Reduction: latent factor models.
singular value decomposition (SVD)-like methods. Netflix??? SVD for large-scale recommendations??? reduce the rank???. An approach for dimensionality reduction.
principal component analysis (PCA)-like methods.
Bias (p.49-p.51)
Bias: caused by filling the unspecified entries with average values along either the rows or columns.
Maximum Likelihood Estimation: covariance matrix is computed.
Aggarwal, 2016; p.50
When the matrix is sparse, the covariance estimates will be statistically unreliable. Matrix factorization methods such as singular value decomposition (SVD) are suggested.
One can approximately factorize the matrix by using truncated SVD, which can be computed as follows:
1. User-Based Nearest Neighbor Regression (p.53)
The predicted rating of target user \(u\) for item \(j\) is as follows:
\(\hat{r}_{u j}=\mu_{u}+\frac{\sum_{v \in P_{u}(j)} \operatorname{Sim}(u, v) \cdot\left(r_{v j}-\mu_{v}\right)}{\sum_{v \in P_{u}(j)}|\operatorname{Sim}(u, v)|}\) (2.22; p. 53)
The least-squares objective function is as follows:
\(\begin{aligned} \text { Minimize } J_{u} &=\sum_{j \in I_{u}}\left(r_{u j}-\hat{r}_{u j}\right)^{2} \\ &=\sum_{j \in I_{u}}\left(r_{u j}-\left[\mu_{u}+\sum_{v \in P_{u}(j)} w_{v u}^{u s e r} \cdot\left(r_{v j}-\mu_{v}\right)\right]\right)^{2} \end{aligned}\) (p.55)
The consolidated optimization problem is as follows:
\(\operatorname{Minimize} \sum_{u=1}^{m} J_{u}=\sum_{u=1}^{m} \sum_{j \in I_{u}}\left(r_{u j}-\left[\mu_{u}+\sum_{v \in P_{u}(j)} w_{v u}^{u s e r} \cdot\left(r_{v j}-\mu_{v}\right)\right]\right)^{2}\) (2.23, p.54)
The User-Based Nearest Neighbor Regression Model: (2.26; p.55)
\(\hat{r}_{u j}=b_{u}^{\text {user }}+b_{j}^{i t e m}+\frac{\sum_{v \in P_{u}(j)} w_{v u}^{\text {user }} \cdot\left(r_{v j}-b_{v}^{\text {user }}-b_{j}^{\text {item }}\right)}{\sqrt{\left|P_{u}(j)\right|}}\) (2.26; p.55)
sparsity and Bias
2. Item-Based Nearest Neighbor Regression (p.55)
The predicted rating of target user \(u\) for item \(t\) is as follows:
\(\hat{r}_{u t}=\sum_{j \in Q_{t}(u)} w_{j t}^{i t e m} \cdot r_{u j}\) (2.27, p.55)
The least-squares objective function is as follows:
\(\begin{aligned} \text { Minimize } J_{t} &=\sum_{u \in U_{t}}\left(r_{u t}-\hat{r}_{u t}\right)^{2} \\ &=\sum_{u \in U_{t}}\left(r_{u t}-\sum_{j \in Q_{t}(u)} w_{j t}^{i t e m} \cdot r_{u j}\right)^{2} \end{aligned}\) (p.56)
The consolidated optimization problem is as follows:
\(\operatorname{Minimize} \sum_{t=1}^{n} \sum_{u \in U_{t}}\left(r_{u t}-\sum_{j \in Q_{t}(u)} w_{j t}^{i t e m} \cdot r_{u j}\right)^{2}\) (2.28; p.56)
The Item-Based Nearest Neighbor Regression Model: (2.30; p.55)
\(\hat{r}_{u t}=b_{u}^{u s e r}+b_{t}^{i t e m}+\frac{\sum_{j \in Q_{t}(u)} w_{j t}^{i t e m} \cdot\left(r_{u j}-B_{u j}\right)}{\sqrt{\left|Q_{t}(u)\right|}}\) (2.30; p.56)
3. Combing Two Methods (p.57)
The combined model can be expressed as follows:
\(\hat{r}_{u j}=b_{u}^{u s e r}+b_{j}^{i t e m}+\frac{\sum_{v \in P_{u}(j)} w_{v u}^{\text {user }} \cdot\left(r_{v j}-B_{v j}\right)}{\sqrt{\left|P_{u}(j)\right|}}+\frac{\sum_{j \in Q_{t}(u)} w_{j t}^{i t e m} \cdot\left(r_{u j}-B_{u j}\right)}{\sqrt{\left|Q_{t}(u)\right|}}\) (2.31, p.57)
4. Joint Interpolation with Similarity Weighting (p.57)
The objective function:
\(\begin{aligned} \text { Minimize } & \sum_{s:(u, s) \in S} \sum_{j: j \neq s} \operatorname{AdjustedCosine}(j, s) \cdot\left(r_{u s}-\hat{r}_{u j}\right)^{2} \\ &=\sum_{s:(u, s) \in S j: j \neq s} \sum_{j d j u s t e d C o s i n e}(j, s) \cdot\left(r_{u s}-\left[\mu_{u}+\sum_{v \in P_{u}(j)} w_{v u}^{u s e r} \cdot\left(r_{v j}-\mu_{v}\right)\right]\right)^{2} \end{aligned}\) (p.57)
5. Sparse Linear Models (SLIM) (p.58)
the prediction function in SLIM is:
\(\hat{r}_{u t}=\sum^{n} w_{j t}^{i t e m} \cdot r_{u j} \forall u \in\{1 \ldots m\}, \forall t \in\{1 \ldots n\}\) (2.33; p.58)
the objective function is:
\(\begin{aligned} \text { Minimize } J_{t}^{s} &=\sum_{u=1}^{m}\left(r_{u t}-\hat{r}_{u t}\right)^{2}+\lambda \cdot \sum_{j=1}^{n}\left(w_{j t}^{i t e m}\right)^{2}+\lambda_{1} \cdot \sum_{j=1}^{n}\left|w_{j t}^{i t e m}\right| \\ &=\sum_{u=1}^{m}\left(r_{u t}-\sum_{j=1}^{n} w_{j t}^{i t e m} \cdot r_{u j}\right)^{2}+\lambda \cdot \sum_{j=1}^{n}\left(w_{j t}^{i t e m}\right)^{2}+\lambda_{1} \cdot \sum_{j=1}^{n}\left|w_{j t}^{i t e m}\right| \\ \text { subject to: } \end{aligned}\) \[ \begin{aligned} &w_{j t}^{i t e m} \geq 0 \forall j \in\{1 \ldots n\} \\ &w_{t t}^{\text {item }}=0 \end{aligned} \] (p.59)
1. User-Item Graphs (p.61-p.63)
2. User-User Graphs (p.63-p.65)
Aggarwal, 2016; p. 64
3. Item-Item Graphs (p.66-p.67)
Aggarwal, 2016; p. 66
Recall CH2 The neighbor-based methods: generalizations of \(k-\)nearest neighbor classifiers, in which the prediction approach is specific to the instance being predicted. For example, in user-based neighborhood methods, the peers of the target user are determined in order to perform the prediction.
Related neighbor-based methods: User-based, item-based, unified, clustering, dimensionality reduction, nearest neighbor regression.
Model-based methods: the training (or model-building phase) is clearly separated from the prediction phase. (p.71)
Related Model-based methods: decision trees, rule-based methods, Bayes classifier, regression models, support vector machine (SVM), and neural networks. (p.71)
Use some of them for classification tasks in educational assessments??.
Aggarwal, 2016; p. 72
Differences between classification tasks and matrix completion tasks (collaborative filtering) (p.72-p.73)
Advantages of Model-based recommender systems over neighborhood-based methods (p.73)
Decision and regression trees: classification tasks.
Decision trees: DVs are categorical. It is a hierarchical partitioning of the data space with the use of a set of hierarchical decisions.
Gini index G(S) of the node:
\(G(S)=1-\sum_{i=1}^{r} p_{i}^{2}\) (3.1, p.74)
which lies between 0 and 1, with smaller values being more indicative of greater discriminative power.
\(\operatorname{Gini}\left(S \Rightarrow\left[S_{1}, S_{2}\right]\right)=\frac{n_{1} \cdot G\left(S_{1}\right)+n_{2} \cdot G\left(S_{2}\right)}{n_{1}+n_{2}}\) (3.2, p.74)
Aggarwal, 2016; p. 75
1. Extending Decision Tree to Collaborative Filtering. (p.76-p.77)
The number of decision trees constructed is exactly equal to the number n of attributes (items). (p.76)
Dimensionality reduction methods: lower-dimensional representation of data. {style=“color:red”} (p.77)
Regression tress: DVs are numerical
Aggarwal, 2016; p. 78
The key in association rule mining is to determine sets of items that are closely correlated in teh transaction database by Support (s) and Confidence (c).
Association Rules: (p.78).
Aggarwal, 2016; p. 78
Comments:Unary transection data is used here, not related to educational assessment (omitted)
\(P\left(r_{u j}=v_{s} \mid\right.\) Observed ratings in \(\left.I_{u}\right)=\frac{P\left(r_{u j}=v_{s}\right) \cdot P\left(\text { Observed ratings in } I_{u} \mid r_{u j}=v_{s}\right)}{P\left(\text { Observed ratings in } I_{u}\right)}\)
which is also expressed as (3.5, p.82):
\[ P\left(r_{u j}=v_{s} \mid \text { Observed ratings in } I_{u}\right) \propto P\left(r_{u j}=v_{s}\right) \cdot P\left(\text { Observed ratings in } I_{u} \mid r_{u j}=v_{s}\right) \]
\[ P\left(\text { Observed ratings in } I_{u} \mid r_{u j}=v_{s}\right)=\prod_{k \in I_{u}} P\left(r_{u k} \mid r_{u j}=v_{s}\right) \] - The value of \(P\left(r_{u k} \mid r_{u j}=v_{s}\right)\) is estimated as the fraction of users that have specified the rating of \(r_{uk}\) for the \(k^{th}\) item, given they have specified the rating of their \(j^{th}\) item to \(v_s\).
\[ P\left(r_{u j}=v_{s} \mid \text { Observed ratings in } I_{u}\right) \propto P\left(r_{u j}=v_{s}\right) \cdot \prod_{k \in I_{u}} P\left(r_{u k} \mid r_{u j}=v_{s}\right) \] - *Example with Binary Ratings** (p.85)
**Comments: it is commonly used in educational assessment.
An off-the-shelf classification algorithm can be used as a black-box to predict the ratings of one of the items with the ratings of other items. (p.86)
Example: Using a Neural Network as a Black-Box. (p.88-p90)
Artificial neural networks (ANNs): basic computation unit (neurons), weights (strengths of synaptic 突触 connections); basic architecture (perceptron 感知器-a set of input nodes and an output node) (p.87-p.88). In general, an ANN can have multiple layers, and intermediate nodes can compute nonlinear functions (Figure 3.3(b)) (p.89).
Advantages of ANNs: to compute complex nonlinear functions, for complex classification tasks. Regularization can be used to reduce the impact of noise.
Recommendations: only users with valid ratings of items are used; users should not all be highly similar to one another (preselect muturally diverse users in the initial phase)
Comments: Maybe I can try to use ANNs for complex classification tasks in educational assessment???
- The signed linear function for binary output is as follows: (3.9, p.88)
\(z_{i}=\operatorname{sign}\left\{\bar{W} \cdot \overline{X_{i}}+b\right\}\)
\(\bar{W}^{t+1}=\bar{W}^{t}+\alpha\left(y_{i}-z_{i}\right) \overline{X_{i}}\)
\(\alpha > 0\) is the learning rate
\(\bar{W}^{t}\) is the value of weight vector in the \(t^{th}\) iteration
Latent factor models: is state-of-the-art (SOTA) in collaborative filtering in recommender systems (p.91).
The family of matrix factorization methods (p.127)
1. Geometric Intuition for Latent Factor Models (p. 91-p.92)
Dimension Reduction Methods: Principal component Analysis (PCA); (Mean-centered) Singular Value Decomposition (SVD)-leverage the inter-attribute correlations and redundancies to infer unspecified entries (Comments: sparse data or missing data??).
We must use a partially specified data set to recover the low-dimensional hyperplane on which the data approximately lies (p.93)
Why? p.92: to implicitly capture the underlying redundancies in the correlation structure of the data and reconstruct all the missing values in one shot. If the data does not have any correlations or redundancies, a latent factor model will simply not work.
2. Low-Rank Intuition for Latent Factor Models (p. 93-p.92)
3. Basic Matrix Factorization Principles (p. 93-p.92)
- The \(m \times N\) rating matrix \(R\) is approximately factorized into an \(m\times k\)matrix \(U\) and an \(n \times K\) matrix \(V\) is as follows:
\(R \approx U V^{T}\) (3.13, p.94)
\(r_{i j} \approx \overline{u_{i}} \cdot \overline{v_{j}}\) (3.14, p.94)
\(\begin{aligned} r_{i j} & \approx \sum_{s=1}^{k} u_{i s} \cdot v_{j s} \\ &=\sum_{s=1}^{k}(\text { Affinity of user } i \text { to concept s) } \times(\text { Affinity of item } j \text { to concept s })\end{aligned}\)
\[ \begin{aligned} r_{i j} \approx &(\text { Affinity of user } i \text { to history }) \times(\text { Affinity of item } j \text { to history }) \\ &+(\text { Affinity of user } i \text { to romance }) \times(\text { Affinity of item } j \text { to romance }) \end{aligned} \]
4. Unconstrained Matrix Factorization (a lot of details) (p.96-p.113)
Singular value decomposition (SVD): columns of \(U\) and \(V\) are orthogonal.
An optimization problem regarding \(U\) and \(V\) of its objective function is formulated as:
Minimize \(J=\frac{1}{2}\left\|R-U V^{T}\right\|^{2}\) subject to: No constraints on \(U\) and \(V\)
\[ S=\left\{(i, j): r_{i j} \text { is observed }\right\} \]
Minimize \(J=\frac{1}{2} \sum_{(i, j) \in S} e_{i j}^{2}=\frac{1}{2} \sum_{(i, j) \in S}\left(r_{i j}-\sum_{s=1}^{k} u_{i s} \cdot v_{j s}\right)^{2}\) subject to: No constraints on \(U\) and \(V\)
- The partial derivative of \(J\) to \(u_{jq}\) and \(v_{jq}\):
Aggarwal, 2016; p. 98
\(U \Leftarrow U+\alpha E V\) \(V \Leftarrow V+\alpha E^{T} U\)
Stochastic Gradient Descent is used for optimization, efficient, but sensitive, not very stable.
\[ \begin{aligned} &u_{i q} \Leftarrow u_{i q}-\alpha \cdot\left[\frac{\partial J}{\partial u_{i q}}\right]_{\text {Portion contributed by }(i, j)} \quad \forall q \in\{1 \ldots k\} \\ &v_{j q} \Leftarrow v_{j q}-\alpha \cdot\left[\frac{\partial J}{\partial v_{j q}}\right]_{\text {Portion contributed by }(i, j)} & \forall q \in\{1 \ldots k\} \end{aligned} \] - The 2-K updates *specific to the observed entry \((i, j)\in S\) are:
\[ \begin{array}{ll} u_{i q} \Leftarrow u_{i q}+\alpha \cdot e_{i j} \cdot v_{j q} & \forall q \in\{1 \ldots k\} \\ v_{j q} & \Leftarrow v_{j q}+\alpha \cdot e_{i j} \cdot u_{i q} & \forall q \in\{1 \ldots k\} \end{array} \] - K-dimensional vectorized form:
\[ \begin{aligned} &\overline{u_{i}} \Leftarrow \overline{u_{i}}+\alpha e_{i j} \overline{v_{j}} \\ &\overline{v_{j}} \Leftarrow \overline{v_{j}}+\alpha e_{i j} \overline{u_{i}} \end{aligned} \] - Regularization (p.100-p.103)
Regularization: to reduce the tendency of the model to overfit at the expense of introducing a bias.
- Alternating Least Squares (ALS) and Coordinate Descent (p.105)
Alternating Least Squares and Coordinate Descent is also used for optimization, and it is more stable comparing to Stochastic Gradient Descent, but it is not very efficient.
Incorporating User and Item Biases is a variation of unconstrained model. Using only the bias variables can often provide reasonable good rating predictions (p.109).
- Incorporating Implicit Feedback (p.109-p.113)
5. Singular Value Decomposition (p.113-p.119)
It is a form of matrix factorization in which the columns of \(U\) and \(V\) are constrainted to be mutually orthogonal.
Truncated SVD is computed as:
\(R \approx Q_{k} \Sigma_{k} P_{k}^{T}\)
\(U=Q_{k} \Sigma_{k}\) \(V=P_{k}\)
Minimize \(J=\frac{1}{2}\left\|R-U V^{T}\right\|^{2}\) subject to: Columns of \(U\) are mutually orthogonal Columns of \(V\) are mutually orthogonal
6. Non-negative Matrix Factorization (p.119-p.125)
Aggarwal, 2016; p. 122
7. Understanding the Matrix Factorization Family (p.126-p.128)
The goal of objective function/loss function is to make \(UV^{T}\) approximate the ratings matrix \(R\) as closely as possible.
The Matrix Factorization Family can be written as follows:
The collaborative filtering problem can be viewed as a generalization of the problem of classification.
Latent factor models are highly tailored to the collaborative filtering problem with different types of factorization using to predict rating.
Issues to consider: objective functions; constraints on basis matrices; trade-offs in accuracy, overfitting, and intepretability.
Latent factor models are the state-of-the-art (SOTA) in collaborative filtering. It can be combined with neighborhood methods to be more powerful.