Intro

We will build collaborative filtering-based recommender systems for movie ratings using a regression-based technique to account for user and item biases, as well as matrix factorization methods to find latent themes.


Load libraries

import pandas as pd
import numpy as np
import funk_svd as fsvd
np.set_printoptions(suppress=True)

import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/Anaconda3/Library/plugins/platforms'


Get data

We will use the MovieLens dataset of 100,000 movie ratings, which we procure from Github and pivot to user-item matrix form. Each row represents a user, each column a movie, and the values are the ratings (nan’s are listed for any movie not rated.)

col_names = ["user_id", "item_id", "rating", "timestamp"]
ratings_df = pd.read_csv('https://raw.githubusercontent.com/mehtablocker/cuny_612/master/data_files/MovieLens/u.data', sep='\t', names=col_names, encoding='latin-1')
ui_mat = ratings_df.pivot_table(index="user_id", columns="item_id", values="rating").to_numpy()
ui_mat[0:5, 0:5]
## array([[ 5.,  3.,  4.,  3.,  3.],
##        [ 4., nan, nan, nan, nan],
##        [nan, nan, nan, nan, nan],
##        [nan, nan, nan, nan, nan],
##        [ 4.,  3., nan, nan, nan]])


Split data

We start by breaking our data into training and test sets.


### Create test and train indexes
pct_test = 0.2
n_test = int(round(pct_test * np.count_nonzero(~np.isnan(ui_mat)), 0))
non_na_ind = np.argwhere(~np.isnan(ui_mat))
seq = np.arange(len(non_na_ind[:, 0]))
test_ind = np.random.choice(seq, n_test, replace = False)
train_ind = np.delete(seq, test_ind)

### Break matrix into test and train
train_mat = np.copy(ui_mat)
train_mat[non_na_ind[test_ind, 0], non_na_ind[test_ind, 1]] = np.nan
test_mat = np.copy(ui_mat)
test_mat[non_na_ind[train_ind, 0], non_na_ind[train_ind, 1]] = np.nan
train_mat[0:4, 0:4]
## array([[ 5., nan,  4.,  3.],
##        [ 4., nan, nan, nan],
##        [nan, nan, nan, nan],
##        [nan, nan, nan, nan]])
test_mat[0:4, 0:4]
## array([[nan,  3., nan, nan],
##        [nan, nan, nan, nan],
##        [nan, nan, nan, nan],
##        [nan, nan, nan, nan]])


Normalize data

Subtracting a “baseline” from user-item ratings is a common first step in the building of recommender systems. This baseline is usually comprised of a global average (i.e., “global bias”), an average for each user (i.e., “user bias”), and an average for each item (i.e., “item bias”).


### Calculate baseline and normalize matrix
global_mean = np.nanmean(train_mat)
u_bias_vec = np.nanmean(train_mat, axis = 1) - global_mean
u_bias_mat = np.tile(u_bias_vec, (len(train_mat[0,:]), 1)).transpose()
i_bias_vec = np.nanmean(train_mat, axis = 0) - global_mean
i_bias_mat = np.tile(i_bias_vec, (len(train_mat[:, 0]), 1))
baseline_mat = global_mean + u_bias_mat + i_bias_mat
train_mat_norm_simp = train_mat - baseline_mat
train_mat_norm_simp = np.nan_to_num(train_mat_norm_simp)


Alternate baseline

Using simple averages for the baseline calculation is not particularly precise. It might be adequate if the user-item matrix were more densely populated, but with such a sparse matrix, the sample size for an individual user or item is often not large enough to sufficiently account for randomness.

A likely more accurate procedure for obtaining biases is to run a regression in which all users and items are the predictor variables and the ratings are the response variable. Then the slope coefficients for each user and item are the “user bias” and “item bias”, and the intercept coefficient is the “global bias”.

Implementation

One way of achieving such a regression is to “one-hot encode” every user, item, and rating. For example, let’s say Joe gave Toy Story a 5 rating. Rather than having a user-item matrix with a 5 in the cell that corresponds to Joe and Toy Story, we instead have a matrix where the columns are every user and every movie, and an additional column is the rating. In that case our first row would have a zero in every cell except a 1 for Joe, a 1 for Toy Story, and a 5 for Rating. And the same idea for the next row (rating.)

For example:

library(knitr)
library(tibble)

A_b <- tibble(Bob=c(0, 0), Joe=c(1,0), Nicole=c(0, 1), `all other users...`=c("0...", "0..."), `Toy Story`=c(1,0), `Forrest Gump`=c(0,1), `all other movies...`=c("0...", "0..."), Rating=c(5, 4))

kable(A_b)
Bob Joe Nicole all other users… Toy Story Forrest Gump all other movies… Rating
0 1 0 0… 1 0 0… 5
0 0 1 0… 0 1 0… 4

We could then run a regression in which the Rating column is the response variable and all of the other columns are the predictors. The result of the regression would give a coefficient for every user and item, as well as an intercept.

Considerations

Computing power is a concern when running a regression of this size. Even with the moderately sized MovieLens 100k ratings, we’ll be computing coefficients for over 25,000 predictor variables. This is computationally expensive.

To combat this expense we can use a linear algebra trick. Instead of running a regression with our one-hot encode matrix A and our ratings vector b to find our coefficient vector x, we can use the matrix AT A and the vector AT b, and then directly solve for x. This is a much cheaper computational task. To make this work we will need to employ a couple other tricks (to account for the intercept as well as the fact that the AT A matrix is singular), but overall it is a fairly easy procedure:


### Create binary matrix out of appropriate quadrants
ur_mat = np.copy(train_mat)
non_na_ind = np.argwhere(~np.isnan(ur_mat))
ur_mat[non_na_ind[:,0], non_na_ind[:,1]] = 1
ur_mat = np.nan_to_num(ur_mat)
ul_mat = np.zeros((len(train_mat[:,0]), len(train_mat[:,0])))
np.fill_diagonal(ul_mat, np.nansum(ur_mat, axis=1))
lr_mat = np.zeros((len(train_mat[0,:]), len(train_mat[0,:])))
np.fill_diagonal(lr_mat, np.nansum(ur_mat, axis=0))
ll_mat = ur_mat.transpose()
train_mat_bin = np.vstack((np.hstack((ul_mat, ur_mat)), np.hstack((ll_mat, lr_mat))))

### Create means vectors
u_means_vec = np.nanmean(train_mat, axis=1)
i_means_vec = np.nanmean(train_mat, axis=0)
ui_means_vec = np.concatenate([u_means_vec, i_means_vec])
mean_u_means = np.nanmean(u_means_vec)
mean_i_means = np.nanmean(i_means_vec)
u_sums_vec = np.nansum(train_mat, axis=1)
i_sums_vec = np.nansum(train_mat, axis=0)
ui_sums_vec = np.concatenate([u_sums_vec, i_sums_vec])

### Set up solvable matrix version of Bayes / ridge regression by adding to the diagonal of A^T A
K = 3
ATA = np.copy(train_mat_bin)
np.fill_diagonal(ATA, np.diagonal(ATA) + K)
ATb = np.copy(ui_sums_vec)
u_etas = np.repeat(mean_u_means, len(u_means_vec))
i_etas = np.repeat(mean_i_means, len(i_means_vec))
etas = np.concatenate([u_etas, i_etas])
ATb = ATb + etas * K

### Add intercept by adding a row and column to ATA which is total number of ratings (including the added Bayes ratings)
###### concatenated on both sides with number of user (item) ratings (in this case it is diagonal of ATA matrix bc A is
###### all 1s and 0s, so don't need to worry about squared terms.  i.e., 1^2 = 1)
### And add a value to ATb which is the total sum of ratings (including the added Bayes ratings)
total_n_ratings = len(non_na_ind) + len(np.diagonal(ATA)) * K
new_top_row = np.hstack(( total_n_ratings, np.diagonal(ATA) ))
new_side_column = np.diagonal(ATA).reshape(-1,1)
ATA = np.vstack(( new_top_row, np.hstack(( new_side_column, ATA )) ))
total_sum_ratings = np.nansum(u_sums_vec) + np.nansum(etas * K)
ATb = np.concatenate([np.array([total_sum_ratings]), ATb])

### Solve for x
x = np.linalg.solve(ATA, ATb)

ATA[0:5, 0:5]
## array([[87875.,   229.,    50.,    44.,    20.],
##        [  229.,   229.,     0.,     0.,     0.],
##        [   50.,     0.,    50.,     0.,     0.],
##        [   44.,     0.,     0.,    44.,     0.],
##        [   20.,     0.,     0.,     0.,    20.]])
ATb[0:5]
## array([308304.22731246,    833.77606307,    184.77606307,    128.77606307,
##            80.77606307])
x[0:5]
## array([ 3.36257631,  0.27094039,  0.28055452, -0.13172339,  0.77297076])


Model evaluation

Armed with our intercept and slope coefficient vector x, i.e., the new bias values, we can compare the RMSE of the resulting predictions (evaluated on the held-out Test data) to the RMSE of the predictions obtained using standard biases (i.e., simple averages).


### Populate a prediction matrix using the regression coefficients
train_mat_pred = np.full(train_mat.shape, np.nan)
for i in range(len(train_mat_pred[:,0])):
    train_mat_pred[i, ] = x[0] + x[i+1] + x[(len(u_means_vec)+1):len(x)]
below_1_inds = np.argwhere(train_mat_pred<1)
train_mat_pred[below_1_inds[:,0], below_1_inds[:,1]] = 1   #round any value below 1 or above 5
above_5_inds = np.argwhere(train_mat_pred>5)
train_mat_pred[above_5_inds[:,0], above_5_inds[:,1]] = 5
train_mat_norm = train_mat - train_mat_pred
train_mat_norm = np.nan_to_num(train_mat_norm)

### Prediction matrix using standard baseline
train_mat_pred_simp = train_mat_norm_simp + baseline_mat
below_1_inds = np.argwhere(train_mat_pred_simp<1)
train_mat_pred_simp[below_1_inds[:,0], below_1_inds[:,1]] = 1
above_5_inds = np.argwhere(train_mat_pred_simp>5)
train_mat_pred_simp[above_5_inds[:,0], above_5_inds[:,1]] = 5
### Calculate RMSE on Test data
def rmse(predicted, observed, rmna=False):
    if rmna==True:
        return np.nanmean((predicted - observed)**2)**0.5
    else:
        return np.mean((predicted - observed)**2)**0.5

### regression based
rmse_pred = rmse(train_mat_pred, test_mat, rmna=True)
print("RMSE for regression baseline: ", rmse_pred)


### simple baseline
## RMSE for regression baseline:  0.9405088753458019
rmse_pred_simp = rmse(train_mat_pred_simp, test_mat, rmna=True)
print("RMSE for standard baseline: ", rmse_pred_simp)
## RMSE for standard baseline:  0.9672017728146373

As we can see, the regression baseline yields more accurate predictions than the standard baseline!


Matrix factorization

We can elaborate on both of our models by using Singular Value Decomposition on the residuals (i.e., observed rating minus baseline predicted rating). This effectively finds latent factors or “themes” in the values over and above the baseline. In other words, if Bruce rated both Inception and Blade Runner a half point higher than we predicted according to baseline, maybe it was not random but part of a pattern that Bruce likes science-fiction type movies. Matrix factorization techniques seek out these patterns by finding hidden correlations and accordingly reducing dimensionality. (Though, the new themes/factors are not directly interpretable like the example.)

We run SVD on both models and compare the RMSE:


### Perform SVD on regression baseline
u_regr, s_regr, vt_regr = np.linalg.svd(train_mat_norm)
k = 10
u_comp = u_regr[:, 0:k]
s_comp = s_regr[0:k]
vt_comp = vt_regr[0:k, :]
train_mat_norm_comp = u_comp.dot(np.diag(s_comp)).dot(vt_comp)
train_mat_pred_svd = train_mat_norm_comp + train_mat_pred
below_1_inds = np.argwhere(train_mat_pred_svd<1)
train_mat_pred_svd[below_1_inds[:,0], below_1_inds[:,1]] = 1
above_5_inds = np.argwhere(train_mat_pred_svd>5)
train_mat_pred_svd[above_5_inds[:,0], above_5_inds[:,1]] = 5

### RMSE on test data for regression baseline
rmse_pred_svd = rmse(train_mat_pred_svd, test_mat, rmna=True)
print("RMSE for regression baseline: ", rmse_pred_svd)


### Perform SVD on standard baseline
## RMSE for regression baseline:  0.9232544284627855
u_simp, s_simp, vt_simp = np.linalg.svd(train_mat_norm_simp)
u_comp = u_simp[:, 0:k]
s_comp = s_simp[0:k]
vt_comp = vt_simp[0:k, :]
train_mat_norm_simp_comp = u_comp.dot(np.diag(s_comp)).dot(vt_comp)
train_mat_pred_simp_svd = train_mat_norm_simp_comp + train_mat_pred_simp
below_1_inds = np.argwhere(train_mat_pred_simp_svd<1)
train_mat_pred_simp_svd[below_1_inds[:,0], below_1_inds[:,1]] = 1
above_5_inds = np.argwhere(train_mat_pred_simp_svd>5)
train_mat_pred_simp_svd[above_5_inds[:,0], above_5_inds[:,1]] = 5

### RMSE on test data for simple baseline
rmse_pred_simp_svd = rmse(train_mat_pred_simp_svd, test_mat, rmna=True)
print("RMSE for standard baseline: ", rmse_pred_simp_svd)
## RMSE for standard baseline:  0.9423742401733005

Both RMSEs are lower than before, with the regression baseline version outperforming the standard baseline version.

SGD instead of SVD

Another commonly used matrix factorization method is Stochastic Gradient Descent. Known as “Funk SVD” in this context (after Simon Funk from the Netflix project), it is actually an iterative algorithm designed to closely approximate SVD. It is generally considered to handle missing values a bit better than regular SVD (for which we had to impute zeros for NAs in order to run).

We can try SGD instead of SVD (on both models) and compare the RMSE:


### Perform Funk SVD on regression baseline
train_df_norm = pd.DataFrame(data = train_mat_norm).stack().reset_index().rename(columns={'level_0':'u_id', 'level_1':'i_id', 0:'rating'})
train_df_norm.rating = train_df_norm.rating.replace(0, np.nan)   #go back to keeping nan values
f_svd_train = fsvd.SVD(learning_rate=0.001, regularization=0.015, n_epochs=200, n_factors=k)
f_svd_train.fit(train_df_norm)
train_mat_norm_comp = f_svd_train.pu.dot(f_svd_train.qi.transpose())
train_mat_pred_sgd = train_mat_norm_comp + train_mat_pred
below_1_inds = np.argwhere(train_mat_pred_sgd<1)
train_mat_pred_sgd[below_1_inds[:,0], below_1_inds[:,1]] = 1
above_5_inds = np.argwhere(train_mat_pred_sgd>5)
train_mat_pred_sgd[above_5_inds[:,0], above_5_inds[:,1]] = 5

### Perform Funk SVD on standard baseline
train_df_norm_simp = pd.DataFrame(data = train_mat_norm_simp).stack().reset_index().rename(columns={'level_0':'u_id', 'level_1':'i_id', 0:'rating'})
train_df_norm_simp.rating = train_df_norm_simp.rating.replace(0, np.nan)   #go back to keeping nan values
f_svd_train_simp = fsvd.SVD(learning_rate=0.001, regularization=0.015, n_epochs=200, n_factors=k)
f_svd_train_simp.fit(train_df_norm_simp)
train_mat_norm_simp_comp = f_svd_train_simp.pu.dot(f_svd_train_simp.qi.transpose())
train_mat_pred_simp_sgd = train_mat_norm_simp_comp + train_mat_pred_simp
below_1_inds = np.argwhere(train_mat_pred_simp_sgd<1)
train_mat_pred_simp_sgd[below_1_inds[:,0], below_1_inds[:,1]] = 1
above_5_inds = np.argwhere(train_mat_pred_simp_sgd>5)
train_mat_pred_simp_sgd[above_5_inds[:,0], above_5_inds[:,1]] = 5
### RMSE on test data for regression baseline
rmse_test_sgd = rmse(train_mat_pred_sgd, test_mat, rmna=True)
print("RMSE for regression baseline: ", rmse_test_sgd)


### RMSE on test data for standard baseline
## RMSE for regression baseline:  0.9271434820439688
rmse_test_simp_sgd = rmse(train_mat_pred_simp_sgd, test_mat, rmna=True)
print("RMSE for standard baseline: ", rmse_test_simp_sgd)
## RMSE for standard baseline:  0.9498245269124519

Once again, both RMSEs are lower than they were without using matrix factorization, with the regression baseline version outperforming the standard baseline version.


Summary

Using a dataset of movie ratings from MovieLens, we built several different collaborative filtering recommender models.

First we normalized our data and came up with baselines for users and items. The standard way of doing this is to use simple averages, but instead we ran a regression. Due to computational expense we used a few linear algebra tricks to speed up the regression. Next we explored dimensionality reduction using two different forms of matrix factorization.

We evaluated our various models by training them on a subset of the original data and then calculating the RMSE of predictions on the unseen test data. In doing so, we saw more accurate predictions with regression baselines compared to standard baselines, and with matrix factorization compared to no matrix factorization.

Future work could explore the distributions of the biases. Linear regression did a good job of improving point estimates for users and items, but additionally examining the shapes of user and item distributions might prove beneficial.