Problem definition

The data set describes customer orders over time. The goal of the experiment is to predict which products users will buy next, based on their purchase history. For convenience, the original kaggle problem has been slightly reformulated: we should only consider <userid, productid> pairs and discard any other associated features like day of week and hour of day of the purchase. Let’s get started!

The two main packages that we will be using are “recommenderlab” and “recosystem”. These packages provide different algorithms for collaboration filtering (matrix completion). More info can be found here:

  1. recommenderlab package
  2. recosystem package


Load/install libraries

Let’s load the required libraries and source methods first.

Note: We will use the latest (beta) version of the “recommenderlab” package. You should directly install it from github and should NOT use “install.package” (see commented code below)

require(dplyr)
require(data.table)

# library(devtools)
# 
# if("recommenderlab" %in% rownames(installed.packages()) == FALSE){
#   install_github("mhahsler/recommenderlab")
# }

require(recommenderlab)
require(recosystem)
require(ggplot2)
require(grid)
require(gridExtra)

source("crossvalidateALS.R")
source("partitionData.R")
source("crossvalidateUBCF.R")
source("evaluateModel2.R")
source("prepareTrainTestData.R")
source("padTestData.R")


Read data

Next, let’s read the data files (stored locally in “instacart_2017_05_01” directory)

aisles = fread('instacart_2017_05_01/aisles.csv')             # 134 aisles
departments = fread('instacart_2017_05_01/departments.csv')   # 21 departments
products = fread('instacart_2017_05_01/products.csv')         # ~50K products
order_products_prior = fread('instacart_2017_05_01/order_products__prior.csv',key=c("order_id"))    # ~32M individual products ordered
order_products_train = fread('instacart_2017_05_01/order_products__train.csv',key=c("order_id"))    # ~1.3M individual products ordered (ignore)
orders = fread('instacart_2017_05_01/orders.csv',key=c("order_id"))


Data preprocessing

  • We will consider only the orders data marked “prior”.
  • The data set is normalized, so to obtain <userid, productid> pairs, we need to join “orders_prior” and “order_products_prior” tables on the key “order_id”.
  • We will then sort the resulting table (“order_all”) on “user_id” and “product_id” keys
  • From this, we’ll obtain “data_ind” which is a list of individual <user_id, product_id> pairs
  • From “data_ind”, we can accumulate similar <user_id, product_id> pairs and form “data” table which contains <user_id, product_id, count> triples
orders_prior = subset(orders,eval_set=="prior")         # ~3.2M "prior" orders
orders_all = merge(orders_prior,order_products_prior, by="order_id")

orders_all = orders_all[order(orders_all$user_id,orders_all$product_id),]

data_ind = orders_all[,c("user_id", "product_id")]

setkey(data_ind,user_id,product_id)

data = aggregate(cbind(count = product_id) ~ product_id+user_id, data=data_ind, FUN = length)
data = setDT(data[,c(2,1,3)])


Generating train/test data

The data set is too big to work with on a standalone computer, so we are going to downsample it; although ideally, we can hope to get best performance using full data. Let’s see how the distribution of total product purchases across all users looks like

product_counts = aggregate(cbind(total = count) ~ product_id, data=data[,2:3], FUN = sum)
ggplot(product_counts, aes(total)) + geom_histogram(binwidth = 1000) + xlab("number of times bought")

As expected, this looks highly non-uniform. Let’s look at two different ranges for better understanding: products that are bought less than 2000 times and ones bought between 2000 to 20000 times

p1 = ggplot(subset(product_counts,total<2000), aes(total)) + geom_histogram(binwidth = 10) + xlab("number of times bought")
p2 = ggplot(subset(product_counts,total>2000 & total < 20000), aes(total)) + geom_histogram(binwidth = 100) + xlab("number of times bought") + xlim(1000,20000)
grid.arrange(p1, p2, ncol = 2)

Similarly, let’s plot the number of frequency of different users in the data:

user_counts = aggregate(cbind(total = count) ~ user_id, data=data[,c(1:3)], FUN = sum)
ggplot(user_counts, aes(total)) + geom_histogram(binwidth = 10) + xlab("Total number of products bought by a user")

As we can see that the number of products and users decrease sharply with their frequency. In other words:

  • Most products were bought rarely, while only a very few products were frequently bought
  • Most users bought very few products, while very few users bought a lot of products

Given this, there are different ways in which we can down-sample the data:

  • Scheme 1 (Uniform sampling): Randomly pick P products and U users
    • Pros: Data distributions of popular vs. rarely-bought products and frequent vs. infrequent users are preserved
    • Cons: With small P and U, we might end up with mostly rarely bought products and infrequent users, because those are the ones with vast majority.
  • Scheme 2 (Weighted sampling): Pick P products and U users based on their frequency of purchase
    • Pros: We will get more frequent users and popular products, so the data matrix will be more dense, thus increasing the predictive power of the algorithm
    • Cons: Original data distributions are not preserved. So performance might be poor rarely-bought products and infrequent users
  • Scheme 3 (Hybrid sampling): Pick P products and U users based on some combination of Scheme 1 and 2 above. For example, we can divide the products and users into N percentiles based on their frequency and then uniformly sample from each bin
    • Pros: We will get a good mix of all types of users and products, so overall predictive power should be better
    • Cons: Original data distributions are not preserved

For now, we’ll go with Scheme 1, in favor of preserving the original data distribution with P=5000 and U=15000

data_orig = data # save the original data

users = unique(data$user_id)
user_samples = users[sample(length(unique(data$user_id)), 15000, replace = FALSE)]
data = subset(data,user_id %in% user_samples)

products = unique(data$product_id)
product_samples = products[sample(length(unique(data$product_id)), 5000, replace = FALSE)]
data = subset(data,product_id %in% product_samples)
  • Now, partition the data into train/test (70/30) making sure that the following conditions are met:
    1. All product id’s and user id’s are there in both train and test sets
    2. A pair <userid, productid> can either appear in train or test set, but not both
  • In “partitionData” method, we partition the data in two steps:
    1. Randomly partition “data” table into “train” and “test” tables as per the required split ratio
    2. Remove users/products that are there in one set and not in another
    Note: This is not the best way to split the data, and incurs some loss. A better approach would be to partition it based on the joint distribution of users and products in “data”
data_parts = partitionData(data = data, fracData = 1, fracTrain = 0.7)
train = data_parts$train
test = data_parts$test

Ok, now we should have a train/test split where both conditions 1 & 2 are satisfied. Let’s check split ratio again

cat("Train % = ", nrow(train)/(nrow(train) + nrow(test))*100, "and test % = ", nrow(test)/(nrow(train) + nrow(test))*100, "\n")
Train % =  67.99588 and test % =  32.00412 
cat("nrows train = ", nrow(train), "\n")
nrows train =  79307 
cat("nrows test = ", nrow(test), "\n")
nrows test =  37328 
cat("Number of unique user_id's = ", length(unique(train$user_id)), "\n")
Number of unique user_id's =  11263 
cat("Number of unique product_id's = ", length(unique(train$product_id)), "\n")
Number of unique product_id's =  3138 

Also, let’s quickly compare the sparsity of the train/test data w.r.t. original data

cat("Sparsity in original data = ", 100 - nrow(data_orig)/(as.double(length(unique(data_orig$product_id)))*as.double(length(unique(data_orig$user_id))))*100, "%\n")
Sparsity in original data =  99.87009 %
cat("Sparsity in train data = ", 100 - nrow(train)/(as.double(length(unique(train$product_id)))*as.double(length(unique(train$user_id))))*100, "%\n")
Sparsity in train data =  99.77561 %
cat("Sparsity in test data = ", 100 - nrow(test)/(as.double(length(unique(test$product_id)))*as.double(length(unique(test$user_id))))*100, "%\n")
Sparsity in test data =  99.89438 %

So it is roughly the same, which is what we expect with uniform sampling approach


Exploring Weighted Alternating Least Square approach (via “recommenderlab” package)

The “recommenderlab” package implements the ALS (alternating least squares) optimization approach described in Hu et. al.’s paper

Note 1: The package has a routine to tune different parameters: the regularization parameter, lambda and length of the latent factors, d. But the tuning is performed while optimizing for MSE (mean squared error), MAE (mean absolute error) or RMSE (root mean squared error). However, since we are interested in minimizing the MPR (mean percentile ranking), we’ll use our own method for tuning these parameters

Note 2: Ideally, for cross-validation, we should partition the training data into f folds and use leave-one-fold-out technique to tune the parameters. However, in favor of tuning the parameters faster, we will resample the “train” data multiple times to generate training/validation sets and simply average the MPR over different runs (folds) for each set of parameters

source("evaluateModel.R")
source("crossvalidateALS.R")
lambda = c(0.01,1,10)
nFactors = c(10,100,200)
alpha = c(1, 10, 100)
ranks_ALS = crossvalidateALS(data = train, lambda = lambda, nFactors = nFactors, alpha = alpha, fracData = 0.2, cvFolds = 4, nIter = 10, fracTrain = 0.7)
ranks_ALS
, , 1

         [,1]     [,2]     [,3]
[1,] 45.45103 43.43268 43.18959
[2,] 41.96728 41.95470 42.79556
[3,] 34.40228 34.87970 36.81492

, , 2

         [,1]     [,2]     [,3]
[1,] 42.14331 48.82323 48.36222
[2,] 39.80476 37.61222 41.14862
[3,] 29.15106 32.77523 32.15245

, , 3

         [,1]     [,2]     [,3]
[1,] 36.41889 48.94756 43.62423
[2,] 36.68782 42.87251 41.23936
[3,] 36.58844 32.76308 35.26052
lambda_best = lambda[which(ranks_ALS == min(ranks_ALS), arr.ind = TRUE)[1]]
nFactors_best = nFactors[which(ranks_ALS == min(ranks_ALS), arr.ind = TRUE)[2]]
alpha_best = nFactors[which(ranks_ALS == min(ranks_ALS), arr.ind = TRUE)[3]]
cat("best lambda value =", lambda_best, "and best d value =", nFactors_best, "and best alpha value =", alpha_best, "\n")
best lambda value = 10 and best d value = 10 and best alpha value = 100 

Now that we have the optimal parameters, lets train on full training data using the optimal parameter values computed above and compute/evaluate the predictions. Note that, since we downsampled the data, padding with 100 random samples instead of 1000 should be sufficient

R = as(train, "realRatingMatrix")
r = Recommender(R, method = "ALS_implicit", parameter = list(n_factors=nFactors_best, lambda=lambda_best, alpha=alpha_best, n_iterations=20, seed = NULL, verbose = TRUE)) 
reco = predict(r, R, type = "ratings")
rank_final_recolab = evaluateModel(reco = reco, test = test, pad = 100)
cat("Mean MPR across all users = ", mean(rank_final_recolab),"\n")
Mean MPR across all users =  13.49061 
cat("Median MPR across all users = ", median(rank_final_recolab),"\n")
Median MPR across all users =  7.311321 

Let’s also look at the full histogram of individual PR’s

ggplot(as.data.frame(rank_final_recolab), aes(rank_final_recolab, fill= rank_final_recolab > 50)) + geom_histogram(binwidth = 1) + xlab("Individual Percentile Ranking") + ylab("Number of users")

cat("Number of users with better than random predictions =", length(subset(rank_final_recolab,rank_final_recolab<50)), "\n")
Number of users with better than random predictions = 10701 
cat("Number of users with worse than random predictions =", length(subset(rank_final_recolab,rank_final_recolab>=50)), "\n")
Number of users with worse than random predictions = 562 

Let’s also look at the scatter plot between individual user PRs and their frequency of purchase

user_counts_test = aggregate(cbind(total = count) ~ user_id, data=test[,c(1:3)], FUN = sum)
user_counts_test = user_counts_test[order(user_counts_test$user_id),]
pred_ALS = getData.frame(reco, decode = TRUE, ratings = TRUE)
users_ALS = data.frame(unique(pred_ALS$user),rank_final_recolab)
user_counts_test$ranks_ALS = users_ALS$rank_final_recolab
ggplot(user_counts_test) + geom_point(aes(x=total, y=ranks_ALS, color = ranks_ALS > 50)) + xlab("Total purchases") + ylab("Individual Percentile Ranking")

Looks like the learning is quite uniform across all types of users (frequent vs infrequent buyers). Let’s also look at the percentages of users who bought < 25 products

cat('% of users who bought < 25 products (overall) =', nrow(subset(user_counts_test,total<25))/nrow(user_counts_test)*100, "%\n")
% of users who bought < 25 products (overall) = 92.06251 %
cat('% of users who bought < 25 products (with PR > 50%) =', nrow(subset(user_counts_test,total<25 & ranks_ALS > 50))/nrow(subset(user_counts_test, ranks_ALS > 50))*100, "%\n")
% of users who bought < 25 products (with PR > 50%) = 91.62113 %

This verifies the fact that the learning is quite uniform across different types of users


Exploring matrix factorization (via “recosystem” package)

Let’s explore the “recosystem” library. This package has a much more efficient implementation using SGD and also supports parallel/cluster computation; although for the more traditional matrix factorization, which does not modify the cost function for implicit feedback data

Nonetheless, I thought it would be interesting to compare it with the ALS method. In addition to tuning the number of factors, it provides hooks to tune the learning rate (step size in SGD) and two separate regularization coefficients for user and product factors for both l1 and l2 optimizers.

Note: Like earlier, ideally, we should optimize for MPR while doing parameter tuning. But, we’ll ignore that for now and optimize the default L2-norm

r2 = Reco()

opts = r2$tune(data_memory(train$user_id,train$product_id,train$count), opts = list(dim = c(10,100,200), lrate = c(0.1, 1, 10), costp_l2 = c(0.1, 1, 10), costq_l2 = c(0.1, 1, 10), nthread = 4, niter = 20))
opts$min
$dim
[1] 200

$costp_l1
[1] 0

$costp_l2
[1] 0.1

$costq_l1
[1] 0

$costq_l2
[1] 1

$lrate
[1] 0.1

$loss_fun
[1] 3.907436

Now let’s learn the model on full train data using the best parameters computed above

r2$train(data_memory(train$user_id,train$product_id,train$count), opts = c(opts$min, nthread = 4, niter = 100))

Now let’s pad the test data with 100 random samples for each user and evaluate the model and predict the ratings. Since we downsampled the data, padding with 100 random samples instead of 1000 should be sufficient.

test_padded = padTestData(test,train,1000)
reco2 = r2$predict(data_memory(test_padded$user_id, test_padded$product_id), out_memory())
ranks_MF = evaluateModel2(test_padded,reco2)
cat("Mean MPR across all users (MF) = ", mean(ranks_MF),"\n")
Mean MPR across all users (MF) =  27.80224 
cat("Median MPR across all users (MF) = ", median(ranks_MF),"\n")
Median MPR across all users (MF) =  24.70588 

Looks like the mean and median are not too far apart. Let’s look at the histogram of all users’ PR

ggplot(as.data.frame(ranks_MF), aes(ranks_MF, fill= ranks_MF > 50)) + geom_histogram(binwidth = 1) + xlab("Individual Percentile Ranking") + ylab("Number of users")

cat("Number of users with better than random predictions =", length(subset(ranks_MF,ranks_MF<50)), "\n")
Number of users with better than random predictions = 9834 
cat("Number of users with worse than random predictions =", length(subset(ranks_MF,ranks_MF>=50)), "\n")
Number of users with worse than random predictions = 1429 

Let’s also look at the scatter plot between individual user PRs and their frequency of purchase

pred_MF = data.frame(test_padded[,1],test_padded[,2],reco2)
users_MF = data.frame(unique(pred_MF$user_id),ranks_MF)
user_counts_test$ranks_MF = users_MF$ranks_MF
ggplot(user_counts_test) + geom_point(aes(x=total, y=ranks_MF, color = ranks_MF > 50)) + xlab("Total purchases") + ylab("Individual Percentile Ranking")

Observation: It is very evident that the performance of MF is poor mainly for users who are infrequent buyers, which was not the case for ALS. This is expected because ALS is tuned for implicit feedback data where an infrequent user doesn’t correspond to a user who “dislikes” those product. While the MF would think that infrequent users “dislike” that product and would therefore make wrong predictions about what they might buy in the future. In particular, this would mean that the MF algorithm would NOT recommend products which are similar to the ones the infrequent users already bought – which is not correct

Almost all of the cases where the performance is worse than 50% are for users who have bought less than 25 products (upper left corner). Like discussed earlier, a better down-sampling scheme which results in a more dense matrix might yield better performances where we can leverage more from frequent (power) users

cat('% of users who bought < 25 products (overall) =', nrow(subset(user_counts_test,total<25))/nrow(user_counts_test)*100, "%\n")
% of users who bought < 25 products (overall) = 92.06251 %
cat('% of users who bought < 25 products (with PR > 50%) =', nrow(subset(user_counts_test,total<25 & ranks_MF > 50))/nrow(subset(user_counts_test, ranks_MF > 50))*100, "%\n")
% of users who bought < 25 products (with PR > 50%) = 97.4212 %

Summary

Data:

  • Number of users: 11,263
  • Number of products: 3,138
  • Train data points: 79,307 (68%)
  • Test data points: 37,328 (32%)

Output:

  • Mean Percentile Rank using Weighted Alternating Least Squares method (Hu et. al.): 13.49% (Median: 7.41%)
  • Mean Percentile Rank using Matrix Factorization: 27.80% (Median: 24.71%)

Discussion

The experiments performed here are exploratory in nature and do not reflect the true potential of these algorithms. There is a lot of room for improvement:

  1. Data set: We used only a small subset of the data (about 5.5% of users and 6.5% of products). The goal of my experiments were to make them runnable on a single computer. By deploying these algorithms on a spark cluster over the entire data can give much better results.
  2. Sampling: Like described above, we can use a hybrid sampling approach so that even with the small subset of data, we can get a better representation of different types of users and products based on their frequency. This might help improve the results further.
  3. Parameter tuning: Better/more parameter tuning could have resulted in better MPRs:
    • The number of combinations grow exponentially with each new value of a hyperparameter which affects the computation time for cross-validation. We could do more optimized parameter tuning using coarse to fine methods.
    • For Alternating Least Squares method, we didn’t do traditional cross-validation, but a variant of it using smaller data sets. Traditional cross-validation could have yieled better hyperparameter estimates because it works on larger slices of train/validation data.
    • For Matrix Factorization method, we used the built-in parameter tuning which doesn’t optimize the MPR. Writing our own methods for doing that might have given better hyperparameter estimates. We could also try minimizing the L1 loss instead of L2 for both the lambda values
  4. Algorithms: We could also look at the different formulations of the algorithms. An important variation involves including the bias terms for every user and product. This would encode certain users’ preference to buy more/diverse set of products vs others and certain products being generally more popular/important than others. The ALS algorithm can be easily modified to introduce these biases (like suggested here). Likewise, Johnson’s approach using Logistic Matrix Factorization can also be used. That said, I believe the biggest gains will likely come from using more data.
---
title: "Instacart recommendation problem"
output:
  html_notebook: default
  html_document: default
  pdf_document: default
---

### Problem definition

The [data set](https://www.instacart.com/datasets/grocery-shopping-2017) describes customer orders over time. The goal of the experiment is to predict which products users will buy next, based on their purchase history. For convenience, the original kaggle problem has been slightly reformulated: we should only consider \<userid, productid\> pairs and discard any other associated features like day of week and hour of day of the purchase. Let's get started!

The two main packages that we will be using are "recommenderlab" and "recosystem". These packages provide different algorithms for collaboration filtering (matrix completion). More info can be found here:

1. [recommenderlab package](https://cran.r-project.org/web/packages/recommenderlab/index.html)
2. [recosystem package](https://cran.r-project.org/web/packages/recosystem/index.html)

<br>

#### Load/install libraries
Let's load the required libraries and source methods first. 

**Note:**
We will use the latest (beta) version of the "recommenderlab" package. You should directly install it from github and should NOT use "install.package" (see commented code below)
```{r}
require(dplyr)
require(data.table)

# library(devtools)
# 
# if("recommenderlab" %in% rownames(installed.packages()) == FALSE){
#   install_github("mhahsler/recommenderlab")
# }

require(recommenderlab)
require(recosystem)
require(ggplot2)
require(grid)
require(gridExtra)

source("crossvalidateALS.R")
source("partitionData.R")
source("crossvalidateUBCF.R")
source("evaluateModel2.R")
source("prepareTrainTestData.R")
source("padTestData.R")
```
<br>

#### Read data

Next, let's read the data files (stored locally in "instacart_2017_05_01" directory)
```{r}
aisles = fread('instacart_2017_05_01/aisles.csv')             # 134 aisles
departments = fread('instacart_2017_05_01/departments.csv')   # 21 departments
products = fread('instacart_2017_05_01/products.csv')         # ~50K products
order_products_prior = fread('instacart_2017_05_01/order_products__prior.csv',key=c("order_id"))    # ~32M individual products ordered
order_products_train = fread('instacart_2017_05_01/order_products__train.csv',key=c("order_id"))    # ~1.3M individual products ordered (ignore)
orders = fread('instacart_2017_05_01/orders.csv',key=c("order_id"))
```
<br>

#### Data preprocessing
* We will consider only the orders data marked "prior". 
* The data set is normalized, so to obtain \<userid, productid> pairs, we need to join "orders_prior" and "order_products_prior" tables on the key "order_id". 
* We will then sort the resulting table ("order_all") on "user_id" and "product_id" keys 
* From this, we'll obtain "data_ind" which is a list of individual \<user_id, product_id\> pairs
* From "data_ind", we can accumulate similar \<user_id, product_id\> pairs and form "data" table which contains \<user_id, product_id, count\> triples
```{r}
orders_prior = subset(orders,eval_set=="prior")         # ~3.2M "prior" orders
orders_all = merge(orders_prior,order_products_prior, by="order_id")

orders_all = orders_all[order(orders_all$user_id,orders_all$product_id),]

data_ind = orders_all[,c("user_id", "product_id")]

setkey(data_ind,user_id,product_id)

data = aggregate(cbind(count = product_id) ~ product_id+user_id, data=data_ind, FUN = length)
data = setDT(data[,c(2,1,3)])
```
<br>

#### Generating train/test data

The data set is too big to work with on a standalone computer, so we are going to downsample it; although ideally, we can hope to get best performance using full data. Let's see how the distribution of total product purchases across all users looks like

```{r}
product_counts = aggregate(cbind(total = count) ~ product_id, data=data[,2:3], FUN = sum)
ggplot(product_counts, aes(total)) + geom_histogram(binwidth = 1000) + xlab("number of times bought")
```

As expected, this looks highly non-uniform. Let's look at two different ranges for better understanding: products that are bought less than 2000 times and ones bought between 2000 to 20000 times

```{r}
p1 = ggplot(subset(product_counts,total<2000), aes(total)) + geom_histogram(binwidth = 10) + xlab("number of times bought")
p2 = ggplot(subset(product_counts,total>2000 & total < 20000), aes(total)) + geom_histogram(binwidth = 100) + xlab("number of times bought") + xlim(1000,20000)
grid.arrange(p1, p2, ncol = 2)
```

Similarly, let's plot the number of frequency of different users in the data:

```{r}
user_counts = aggregate(cbind(total = count) ~ user_id, data=data[,c(1:3)], FUN = sum)
ggplot(user_counts, aes(total)) + geom_histogram(binwidth = 10) + xlab("Total number of products bought by a user")
```
As we can see that the number of products and users decrease sharply with their frequency. In other words:

* Most products were bought rarely, while only a very few products were frequently bought
* Most users bought very few products, while very few users bought a lot of products

Given this, there are different ways in which we can down-sample the data:

* **Scheme 1 (Uniform sampling):** Randomly pick *P* products and *U* users
    + *Pros:* Data distributions of popular vs. rarely-bought products and frequent vs. infrequent users are preserved
    + *Cons:* With small *P* and *U*, we might end up with mostly rarely bought products and infrequent users, because those are the ones with vast majority.
    
* **Scheme 2 (Weighted sampling):** Pick *P* products and *U* users based on their frequency of purchase
    + *Pros:* We will get more frequent users and popular products, so the data matrix will be more dense, thus increasing the predictive power of the algorithm
    + *Cons:* Original data distributions are not preserved. So performance might be poor rarely-bought products and infrequent users
    
* **Scheme 3 (Hybrid sampling):** Pick *P* products and *U* users based on some combination of Scheme 1 and 2 above. For example, we can divide the products and users into *N* percentiles based on their frequency and then uniformly sample from each bin
    + *Pros:* We will get a good mix of all types of users and products, so overall predictive power should be better
    + *Cons:* Original data distributions are not preserved
  
**For now, we'll go with Scheme 1, in favor of preserving the original data distribution with *P*=5000 and *U*=15000** 

```{r}
data_orig = data # save the original data

users = unique(data$user_id)
user_samples = users[sample(length(unique(data$user_id)), 15000, replace = FALSE)]
data = subset(data,user_id %in% user_samples)

products = unique(data$product_id)
product_samples = products[sample(length(unique(data$product_id)), 5000, replace = FALSE)]
data = subset(data,product_id %in% product_samples)

```

* Now, partition the data into train/test (70/30) making sure that the following conditions are met:
    1. All product id's and user id's are there in both train and test sets
    2. A pair \<userid, productid\> can either appear in train or test set, but not both
* In "partitionData" method, we partition the data in two steps:
    1. Randomly partition "data" table into "train" and "test" tables as per the required split ratio
    2. Remove users/products that are there in one set and not in another
    
    **Note:** This is not the best way to split the data, and incurs some loss. A better approach would be to partition it based on the joint distribution of users and products in "data"

```{r}
data_parts = partitionData(data = data, fracData = 1, fracTrain = 0.7)
train = data_parts$train
test = data_parts$test
```

Ok, now we should have a train/test split where both conditions 1 & 2 are satisfied. Let's check split ratio again
```{r}
cat("Train % = ", nrow(train)/(nrow(train) + nrow(test))*100, "and test % = ", nrow(test)/(nrow(train) + nrow(test))*100, "\n")
cat("nrows train = ", nrow(train), "\n")
cat("nrows test = ", nrow(test), "\n")
cat("Number of unique user_id's = ", length(unique(train$user_id)), "\n")
cat("Number of unique product_id's = ", length(unique(train$product_id)), "\n")
```

Also, let's quickly compare the sparsity of the train/test data w.r.t. original data
```{r}
cat("Sparsity in original data = ", 100 - nrow(data_orig)/(as.double(length(unique(data_orig$product_id)))*as.double(length(unique(data_orig$user_id))))*100, "%\n")
cat("Sparsity in train data = ", 100 - nrow(train)/(as.double(length(unique(train$product_id)))*as.double(length(unique(train$user_id))))*100, "%\n")
cat("Sparsity in test data = ", 100 - nrow(test)/(as.double(length(unique(test$product_id)))*as.double(length(unique(test$user_id))))*100, "%\n")
```
So it is roughly the same, which is what we expect with uniform sampling approach

<br>

#### Exploring Weighted Alternating Least Square approach (via "recommenderlab" package)
The "recommenderlab" package implements the ALS (alternating least squares) optimization approach described in [Hu et. al.'s paper](http://yifanhu.net/PUB/cf.pdf)

**Note 1:** The package *has* a routine to tune different parameters: the regularization parameter, lambda and length of the latent factors, *d*. But the tuning is performed while optimizing for MSE (mean squared error), MAE (mean absolute error) or RMSE (root mean squared error). However, since we are interested in minimizing the MPR (mean percentile ranking), we'll use our own method for tuning these parameters

**Note 2:** Ideally, for cross-validation, we should partition the training data into *f* folds and use leave-one-fold-out technique to tune the parameters. However, in favor of tuning the parameters faster, we will resample the "train" data multiple times to generate training/validation sets and simply average the MPR over different runs (folds) for each set of parameters

```{r}
source("evaluateModel.R")
source("crossvalidateALS.R")
lambda = c(0.01,1,10)
nFactors = c(10,100,200)
alpha = c(1, 10, 100)
ranks_ALS = crossvalidateALS(data = train, lambda = lambda, nFactors = nFactors, alpha = alpha, fracData = 0.2, cvFolds = 4, nIter = 10, fracTrain = 0.7)
```


```{r}
ranks_ALS
```

```{r}
lambda_best = lambda[which(ranks_ALS == min(ranks_ALS), arr.ind = TRUE)[1]]
nFactors_best = nFactors[which(ranks_ALS == min(ranks_ALS), arr.ind = TRUE)[2]]
alpha_best = nFactors[which(ranks_ALS == min(ranks_ALS), arr.ind = TRUE)[3]]

cat("best lambda value =", lambda_best, "and best d value =", nFactors_best, "and best alpha value =", alpha_best, "\n")
```

Now that we have the optimal parameters, lets train on full training data using the optimal parameter values computed above and compute/evaluate the predictions. Note that, since we downsampled the data, padding with 100 random samples instead of 1000 should be sufficient

```{r}
R = as(train, "realRatingMatrix")
r = Recommender(R, method = "ALS_implicit", parameter = list(n_factors=nFactors_best, lambda=lambda_best, alpha=alpha_best, n_iterations=20, seed = NULL, verbose = TRUE)) 
reco = predict(r, R, type = "ratings")
rank_final_recolab = evaluateModel(reco = reco, test = test, pad = 100)
```

```{r}
cat("Mean MPR across all users = ", mean(rank_final_recolab),"\n")
cat("Median MPR across all users = ", median(rank_final_recolab),"\n")
```
Let's also look at the full histogram of individual PR's
```{r}
ggplot(as.data.frame(rank_final_recolab), aes(rank_final_recolab, fill= rank_final_recolab > 50)) + geom_histogram(binwidth = 1) + xlab("Individual Percentile Ranking") + ylab("Number of users")
```
```{r}
cat("Number of users with better than random predictions =", length(subset(rank_final_recolab,rank_final_recolab<50)), "\n")
cat("Number of users with worse than random predictions =", length(subset(rank_final_recolab,rank_final_recolab>=50)), "\n")
```
Let's also look at the scatter plot between individual user PRs and their frequency of purchase
```{r}
user_counts_test = aggregate(cbind(total = count) ~ user_id, data=test[,c(1:3)], FUN = sum)
user_counts_test = user_counts_test[order(user_counts_test$user_id),]
pred_ALS = getData.frame(reco, decode = TRUE, ratings = TRUE)
users_ALS = data.frame(unique(pred_ALS$user),rank_final_recolab)
user_counts_test$ranks_ALS = users_ALS$rank_final_recolab
ggplot(user_counts_test) + geom_point(aes(x=total, y=ranks_ALS, color = ranks_ALS > 50)) + xlab("Total purchases") + ylab("Individual Percentile Ranking")
```
Looks like the learning is quite uniform across all types of users (frequent vs infrequent buyers). Let's also look at the percentages of users who bought < 25 products
```{r}
cat('% of users who bought < 25 products (overall) =', nrow(subset(user_counts_test,total<25))/nrow(user_counts_test)*100, "%\n")
cat('% of users who bought < 25 products (with PR > 50%) =', nrow(subset(user_counts_test,total<25 & ranks_ALS > 50))/nrow(subset(user_counts_test, ranks_ALS > 50))*100, "%\n")
```
This verifies the fact that the learning is quite uniform across different types of users

<br>

#### Exploring matrix factorization (via "recosystem" package)

Let's explore the "recosystem" library. This package has a much more efficient implementation using SGD and also supports parallel/cluster computation; although for the more traditional matrix factorization, which **does not** modify the cost function for implicit feedback data

Nonetheless, I thought it would be interesting to compare it with the ALS method. In addition to tuning the number of factors, it provides hooks to tune the learning rate (step size in SGD) and two separate regularization coefficients for user and product factors for both l1 and l2 optimizers. 

**Note:** Like earlier, ideally, we should optimize for MPR while doing parameter tuning. But, we'll ignore that for now and optimize the default L2-norm

```{r}
r2 = Reco()

opts = r2$tune(data_memory(train$user_id,train$product_id,train$count), opts = list(dim = c(10,100,200), lrate = c(0.1, 1, 10), costp_l2 = c(0.1, 1, 10), costq_l2 = c(0.1, 1, 10), nthread = 4, niter = 20))
```

```{r}
opts$min
```

Now let's learn the model on full train data using the best parameters computed above

```{r}
r2$train(data_memory(train$user_id,train$product_id,train$count), opts = c(opts$min, nthread = 4, niter = 100))
```
Now let's pad the test data with 100 random samples for each user and evaluate the model and predict the ratings. Since we downsampled the data, padding with 100 random samples instead of 1000 should be sufficient.
```{r}
test_padded = padTestData(test,train,1000)
reco2 = r2$predict(data_memory(test_padded$user_id, test_padded$product_id), out_memory())
ranks_MF = evaluateModel2(test_padded,reco2)
```

```{r}
cat("Mean MPR across all users (MF) = ", mean(ranks_MF),"\n")
cat("Median MPR across all users (MF) = ", median(ranks_MF),"\n")
```

Looks like the mean and median are not too far apart. Let's look at the histogram of all users' PR
```{r}
ggplot(as.data.frame(ranks_MF), aes(ranks_MF, fill= ranks_MF > 50)) + geom_histogram(binwidth = 1) + xlab("Individual Percentile Ranking") + ylab("Number of users")
```


```{r}
cat("Number of users with better than random predictions =", length(subset(ranks_MF,ranks_MF<50)), "\n")
cat("Number of users with worse than random predictions =", length(subset(ranks_MF,ranks_MF>=50)), "\n")
```
Let's also look at the scatter plot between individual user PRs and their frequency of purchase
```{r}
pred_MF = data.frame(test_padded[,1],test_padded[,2],reco2)
users_MF = data.frame(unique(pred_MF$user_id),ranks_MF)
user_counts_test$ranks_MF = users_MF$ranks_MF
ggplot(user_counts_test) + geom_point(aes(x=total, y=ranks_MF, color = ranks_MF > 50)) + xlab("Total purchases") + ylab("Individual Percentile Ranking")
```
**Observation:** It is very evident that the performance of MF is poor mainly for users who are infrequent buyers, which was not the case for ALS. This is expected because ALS is tuned for implicit feedback data where an infrequent user doesn't correspond to a user who *"dislikes"* those product. While the MF would think that infrequent users "dislike" that product and would therefore make wrong predictions about what they might buy in the future. **In particular, this would mean that the MF algorithm would NOT recommend products which are similar to the ones the infrequent users already bought** -- which is not correct

Almost all of the cases where the performance is worse than 50% are for users who have bought less than 25 products (upper left corner). Like discussed earlier, a better down-sampling scheme which results in a more dense matrix might yield better performances where we can leverage more from frequent (power) users

```{r}
cat('% of users who bought < 25 products (overall) =', nrow(subset(user_counts_test,total<25))/nrow(user_counts_test)*100, "%\n")
cat('% of users who bought < 25 products (with PR > 50%) =', nrow(subset(user_counts_test,total<25 & ranks_MF > 50))/nrow(subset(user_counts_test, ranks_MF > 50))*100, "%\n")
```


### Summary

#### Data:
+ Number of users: 11,263
+ Number of products: 3,138
+ Train data points: 79,307 (68%) 
+ Test data points: 37,328 (32%)

#### Output:
+ Mean Percentile Rank using Weighted Alternating Least Squares method (Hu et. al.): **13.49%** (Median: 7.41%)
+ Mean Percentile Rank using Matrix Factorization: **27.80%** (Median: 24.71%)


### Discussion
The experiments performed here are exploratory in nature and do not reflect the true potential of these algorithms. There is a lot of room for improvement:

1. **Data set:** We used only a small subset of the data (about 5.5% of users and 6.5% of products). The goal of my experiments were to make them runnable on a single computer. By deploying these algorithms on a spark cluster over the entire data can give much better results.
2. **Sampling:** Like described above, we can use a hybrid sampling approach so that even with the small subset of data, we can get a better representation of different types of users and products based on their frequency. This might help improve the results further.
3. **Parameter tuning:** Better/more parameter tuning could have resulted in better MPRs:
    + The number of combinations grow exponentially with each new value of a hyperparameter which affects the computation time for cross-validation. We could do more optimized parameter tuning using coarse to fine methods.
    + For Alternating Least Squares method, we didn't do traditional cross-validation, but a variant of it using smaller data sets. Traditional cross-validation could have yieled better hyperparameter estimates because it works on larger slices of train/validation data.
    + For Matrix Factorization method, we used the built-in parameter tuning which doesn't optimize the MPR. Writing our own methods for doing that might have given better hyperparameter estimates. We could also try minimizing the L1 loss instead of L2 for both the lambda values
4. **Algorithms:** We could also look at the different formulations of the algorithms. An important variation involves including the bias terms for every user and product. This would encode certain users' preference to buy more/diverse set of products vs others and certain products being generally more popular/important than others. The ALS algorithm can be easily modified to introduce these biases ([like suggested here](http://activisiongamescience.github.io/2016/01/11/Implicit-Recommender-Systems-Biased-Matrix-Factorization/)). Likewise, Johnson's approach using [Logistic Matrix Factorization](https://stanford.edu/~rezab/nips2014workshop/submits/logmat.pdf) can also be used. That said, I believe the biggest gains will likely come from using more data.