Project 4

library("recommenderlab")
library("ggplot2")
data(MovieLense)
ratings_movies <- MovieLense[rowCounts(MovieLense)>100,colCounts(MovieLense)>50]
ratings_movies
## 358 x 591 rating matrix of class 'realRatingMatrix' with 60084 ratings.
# we find users who rate more than 50 movies, and the movies that be rated more than 100 times. In this case we are able to reduce the number of "NA"

Splitting the data into Training and Tesing sets. I am following the example in the textbook and make the percentage_training into 70%.

percentage_training <- 0.7

For the test set, we want users has ratings so that we need to find what is the minimum number of movies that users rate and set the parameter is lower than it.

min(rowCounts(ratings_movies))
## [1] 59
items_to_keep <- 50
# above 3 is good ratings, below 3 is bad ratings.
rating_threshold <- 3
# times to run the evaluation
n_eval <- 1
evaluation_set <- evaluationScheme(data=ratings_movies,method = "split", train = percentage_training, given = items_to_keep, goodRating = rating_threshold, k = n_eval)
evaluation_set
## Evaluation scheme with 50 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.700
## Good ratings: >=3.000000
## Data set: 358 x 591 rating matrix of class 'realRatingMatrix' with 60084 ratings.

Constructing training set.

getData(evaluation_set,"train")
## 250 x 591 rating matrix of class 'realRatingMatrix' with 41920 ratings.
nrow(getData(evaluation_set,"train")) / nrow(ratings_movies)
## [1] 0.698324
# it is about 70% for our training set.
nrow(getData(evaluation_set,"known")) / nrow(ratings_movies)
## [1] 0.301676

Evaluate the ratings in order to recommend the movies to new users. To find k value for k fold, I googled online, and most people’s opinion on k value would be 5 -10. So, in my case, I would like to try k=10 to have 10 fold.

# k-fold is the most accurate approach. 
n_fold <- 10
evaluation_ratingSet <- evaluationScheme(data = ratings_movies, method = "cross-validation", k = n_fold, given = items_to_keep,goodRating = rating_threshold)
evaluation_ratingSet
## Evaluation scheme with 50 items given
## Method: 'cross-validation' with 10 run(s).
## Good ratings: >=3.000000
## Data set: 358 x 591 rating matrix of class 'realRatingMatrix' with 60084 ratings.
# Item-based-collaborative filtering, default parameter is "Null", "IBCF recommend new items and predict their ratings."

model_to_evaluate <- "IBCF"
model_parameters <- NULL
eval_recommender <- Recommender(data=getData(evaluation_set,"train"),method = model_to_evaluate, parameter= model_parameters)
items_to_recommend <- 5

eval_prediction <- predict(object = eval_recommender, newdata = getData(evaluation_set,"known"),n=items_to_recommend,type="ratings")

qplot(rowCounts(eval_prediction)) +
  geom_histogram(binwidth = 30) +
  ggtitle("Distribution of movies per user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The number of movies per users is between 300 and 500. The peak is around 400.

we will measure the accuracy and computer RMSE, MSE, and MAE. According to the textbook, RMSE: the standard deviation of the difference between the real and predicted ratings. MSE: Mean of the squared difference between the real and predicted ratings. (Contains the same information as RMSE) MAR: Mean of the absolute difference between the real and predicted ratings.

eval_accuracy <- calcPredictionAccuracy( x = eval_prediction, data = getData(evaluation_set, "unknown"),byUser = TRUE)
head(eval_accuracy)
##         RMSE       MSE       MAE
## 1  2.1196546 4.4929355 1.8201409
## 6  1.6514825 2.7273943 1.3132332
## 10 0.5745401 0.3300964 0.3755680
## 22 2.0831144 4.3393655 1.7038808
## 23 1.2751736 1.6260677 0.9755292
## 26 1.5214371 2.3147708 1.2197639
qplot(eval_accuracy[,"RMSE"])+
  geom_histogram(binwidth = 0.2)+
  ggtitle("Distribution of the RMSE by user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The range for RMSE is from 1.0 to 2.0. The peak is around 1.2.

Evaluating the recommendations. Comparing the recommendations with the items having a postive ratings. we have difined that rating which is below 3 is negative rating, above 3 is positive ragting.

results <- evaluate(x = evaluation_set,method = model_to_evaluate, n = seq(10,50,10))
## IBCF run fold/sample [model time/prediction time]
##   1  [0.925sec/0.061sec]
head(getConfusionMatrix(results)[[1]])
##           TP        FP       FN       TN precision     recall        TPR
## 10  2.666667  7.333333 98.10185 432.8981 0.2666667 0.02610100 0.02610100
## 20  5.342593 14.657407 95.42593 425.5741 0.2671296 0.05382181 0.05382181
## 30  7.601852 22.398148 93.16667 417.8333 0.2533951 0.07666000 0.07666000
## 40  9.916667 30.083333 90.85185 410.1481 0.2479167 0.09926430 0.09926430
## 50 11.768519 38.231481 89.00000 402.0000 0.2353704 0.11912977 0.11912977
##           FPR
## 10 0.01649585
## 20 0.03301149
## 30 0.05058133
## 40 0.06790812
## 50 0.08650725

TP: recommended items that have been purchased. FP: recommended items that haven’t been purchased. FN: not recommended items that have been purchased. TN: not recommended items that haven’t been purchased. A perfect model would have only TP and TN.

ROC curve with true positive rate and false positive rate. true positive rate: percentage of purchased items that have been recommended. TP rate = TP/ (TP+FN). false postitive rate: percentage of not purchase items that have been recommened. FP rate = FP / (FP+TN)

plot(results, annotate = TRUE, main = "ROC curve")

Precision-recall curve with the rate of precision and recall.

precision: percentage of recommended items that have been purchased. precision = FP / (TP+FP) recall: percentage of purchased items that have been recommended, recall = TP / (TP + FN) = True positive rate.

plot(results,"prec/rec", annotate = TRUE, main="Precision-recall")

models_to_evaluate <- list(
  IBCF_cos = list(name = "IBCF", param = list(method = "cosine")),
  IBCF_cor = list(name = "IBCF", param= list (method = "pearson")),
  UBCF_cos = list(name = "UBCF", param = list(method = "cosine")),
  UBCF_cor = list(name = "UBCF", param = list(method = "pearson")),
  random=list(name="RANDOM", param =NULL)
)
n_recommendations <- c (1,5, seq(10,50,10))
list_results <- evaluate(x = evaluation_set, method = models_to_evaluate, n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
##   1  [0.868sec/0.077sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.93sec/0.057sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.004sec/0.159sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.003sec/0.177sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.001sec/0.047sec]
plot(list_results,annotate = 1, legend = "topleft") 

The highest AUC, the area under the ROC curve is from UBCF with pearson, so, in our case, UBCF_cor is the best-performing technique.

Optimizing k-value.

vector_k <- c(5,10,20,30,40)
models_to_evaluate <- lapply(vector_k, function(k){
  list(name="IBCF",param = list(method = "cosine",k = k))
})
names(models_to_evaluate) <- paste0("IBCF_K_", vector_k)
n_recommendations <- c (1,5, seq(10,50,10))
list_results <- evaluate(x = evaluation_set, method = models_to_evaluate, n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
##   1  [0.877sec/0.027sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.851sec/0.032sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.842sec/0.044sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.848sec/0.051sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.727sec/0.061sec]
plot(list_results,annotate = 1, legend = "topleft") 

plot(list_results,"prec/rec",annotate = 1, legend="bottomright")

The k having the biggest AUC is 40, the second one is 30. and if we want to fine the precision, we set k =20.

Now, I run the SVD, SVDF, and ALS algorithms for the recommender system.

# Create the recommender based on SVD, SVDF, and ALS using the training data
r.svd <- Recommender(getData(evaluation_set, "train"),"SVD")
r.svdf <- Recommender(getData(evaluation_set, "train"),"SVDF")
r.als <- Recommender(getData(evaluation_set, "train"),"ALS")

#Compute predicted ratings for the test data.
p.svd <- predict(r.svd, getData(evaluation_set,"known"),type="ratings")
p.svdf <- predict(r.svdf,getData(evaluation_set,"known"),type="ratings")
p.als <- predict(r.als, getData(evaluation_set,"known"),type="ratings")

error <- rbind(svd = calcPredictionAccuracy(p.svd,getData(evaluation_set, "unknown")),
               svdf = calcPredictionAccuracy(p.svdf,getData(evaluation_set,"unknown")),
               als = calcPredictionAccuracy(p.als, getData(evaluation_set,"unknown")))
error
##           RMSE       MSE       MAE
## svd  0.9705659 0.9419982 0.7741054
## svdf 0.8741273 0.7640986 0.6842043
## als  0.8766665 0.7685441 0.6957556

From this error matrix, we find that svdf has the smallest RMSE and MAE.

models_to_evaluate <- list(
  svd = list(name = "svd", param = list(method = "SVD",type="topNList")),
  svdf = list(name = "svdf", param= list (method = "SVDF",type="topNList")),
  als = list(name = "als", param = list(method = "ALS",type="topNList"))
)

n_recommendations <- c (1,5, seq(10,50,10))
list_results <- evaluate(x = evaluation_set, method = models_to_evaluate, n = n_recommendations)
## svd run fold/sample [model time/prediction time]
##   1
## Warning: Unknown parameters: method, type
## Available parameter (with default values):
## k     =  10
## maxiter   =  100
## normalize     =  center
## verbose   =  FALSE
## [0.031sec/0.044sec] 
## svdf run fold/sample [model time/prediction time]
##   1
## Warning: Unknown parameters: method, type
## Available parameter (with default values):
## k     =  10
## gamma     =  0.015
## lambda    =  0.001
## min_epochs    =  50
## max_epochs    =  200
## min_improvement   =  1e-06
## normalize     =  center
## verbose   =  FALSE
## [14.761sec/5.33sec] 
## als run fold/sample [model time/prediction time]
##   1
## Warning: Unknown parameters: method, type
## Available parameter (with default values):
## normalize     =  NULL
## lambda    =  0.1
## n_factors     =  10
## n_iterations  =  10
## min_item_nr   =  1
## seed  =  NULL
## verbose   =  FALSE
## [0.001sec/7.645sec]
plot(list_results,annotate = 1, legend = "topleft") 

vector_k <- c(5,10,20,30,40)
models_to_evaluate <- lapply(vector_k, function(k){
  list(name="SVD",param = list(method = "SVD",k = k,type="topNList"))
})
names(models_to_evaluate) <- paste0("SVD_K_", vector_k)
n_recommendations <- c (1,5, seq(10,50,10))
list_results <- evaluate(x = evaluation_set, method = models_to_evaluate, n = n_recommendations)
## SVD run fold/sample [model time/prediction time]
##   1
## Warning: Unknown parameters: method, type
## Available parameter (with default values):
## k     =  10
## maxiter   =  100
## normalize     =  center
## verbose   =  FALSE
## [0.023sec/0.183sec] 
## SVD run fold/sample [model time/prediction time]
##   1
## Warning: Unknown parameters: method, type
## Available parameter (with default values):
## k     =  10
## maxiter   =  100
## normalize     =  center
## verbose   =  FALSE
## [0.029sec/0.041sec] 
## SVD run fold/sample [model time/prediction time]
##   1
## Warning: Unknown parameters: method, type
## Available parameter (with default values):
## k     =  10
## maxiter   =  100
## normalize     =  center
## verbose   =  FALSE
## [0.046sec/0.044sec] 
## SVD run fold/sample [model time/prediction time]
##   1
## Warning: Unknown parameters: method, type
## Available parameter (with default values):
## k     =  10
## maxiter   =  100
## normalize     =  center
## verbose   =  FALSE
## [0.076sec/0.04sec] 
## SVD run fold/sample [model time/prediction time]
##   1
## Warning: Unknown parameters: method, type
## Available parameter (with default values):
## k     =  10
## maxiter   =  100
## normalize     =  center
## verbose   =  FALSE
## [0.102sec/0.041sec]
plot(list_results,annotate = 1, legend = "topleft") 

plot(list_results,"prec/rec",annotate = 1, legend="bottomright")

However, I find the SVD has the highest AOC. For K value of the , k=10 and k=5 are pretty close but k=10 is slightly higher. so choosing k for AOC maybe a good choice. and For prec/rec, k=10 still better option.

Recommender systems can be evaluated offline or online. I googled online, and find this article,http://ceur-ws.org/Vol-1609/16090642.pdf. In the article, it mentioned that the method called A/B testing where a part of users are served by recommender system A and the another part of users by recommender system B. Online evaluation can reduce the tramsaction costs of finding and selecting items online. Collaborative filtering technique is still the most well-known and the most commonly implemented. For svd, as what we did the comparison between SVD and Collaborative filtering in this project, I think SVD could be very time consuming especially when we deal with the massive data online.