DATA 643 Project 2 - Content-Based and Collaborative Filtering

Data Loading and Preparation
Recommenderlab data preprocessing
User-user collaborative Filtering
Item-item collaborative Filtering
Model application
Discussion
Supportive links

The following demonstration is a film recommender system designed to help users find new movies based upon user to movie rankings contained in the MovieLens dataset. Techniques covered are user-user and item-item collaborative filtering methods. The recommenderlab library is used for the model training and prediction logic.

Data Loading and Preparation

# Data loading
library(recommenderlab)
data(MovieLense)
hist(getRatings(MovieLense), main="Distribution of ratings", breaks=6)

#Data pre-processing
movies <- as(MovieLense, 'data.frame')
movies$user <- as.numeric(movies$user)
movies$item <- as.numeric(movies$item)

Recommenderlab data preprocessing

To be functional inside of the recommenderlab library, the movies data frame must be converted into a sparse matrix and then into a “realRatingMatrix”. Bear in mind that the prediction engine is limited in its processing capacity and your matrix may exceed R’s potential, as was the case with alternative datasets.

sparse_ratings <- sparseMatrix(i = movies$user, j = movies$item, x = movies$rating, 
                               dims = c(length(unique(movies$user)), length(unique(movies$item))),  
                               dimnames = list(paste("u", 1:length(unique(movies$user)), sep = ""), 
                                               paste("m", 1:length(unique(movies$item)), sep = "")))

real_ratings <- new("realRatingMatrix", data = sparse_ratings)
real_ratings

## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.

User-user collaborative Filtering

A similarity matrix is a recommenderlab function that takes the “realRatingMatrix”" and calculates a cosine similarity which aids in the investigation of model development. The function can be toggled between users or items.

#similarity matrix
similarity_users <- similarity(real_ratings[1:25, ], method =  "cosine", which = "users") 
image(as.matrix(similarity_users), main = "User similarity")

Constructing the user-user model, with an evaluationScheme object from recommenderlab using the “split” technique to validate our model. A confusion matrix is an alternative method for validation. The validation is used for the root mean square error (RMSE) checking after our prediction. A “Recommender” object is then given the “UBCF” (User-based collaborative filter), with a center normalization, cosine method, with 25 nearest neighbors.

#Evaluation object for RMSE checking.
set.seed(1)
e <- evaluationScheme(real_ratings, method="split", train=0.8, given=-5)

# Creation of the model - U(ser) B(ased) C(ollaborative) F(iltering)
Rec.model <- Recommender(real_ratings, method = "UBCF", 
                     param=list(normalize = "center", method="Cosine", nn=25))

#Making predictions 
prediction <- predict(Rec.model, real_ratings[1:25, ], type="ratings")
as(prediction, "matrix")[,1:5]

##           m1       m2       m3       m4       m5
## u1  3.605166 3.605166       NA       NA 3.562997
## u2  4.206522 4.206522 4.223339       NA 4.152807
## u3  3.107143 3.107143 3.061856 3.107143 3.093476
## u4  2.895522 2.895522       NA 2.881361 2.882393
## u5  2.615741 2.615741 2.383553 2.823319 2.615741
## u6  3.620690 3.620690 3.600638 3.752644 3.620690
## u7  2.663636 2.663636 2.719601 2.692688       NA
## u8  3.363636 3.363636 3.363636 3.363636 3.396348
## u9  3.765625 3.765625 3.654908 3.822140 3.765625
## u10 2.952381 2.952381 2.949733 2.952381 2.945525
## u11 3.575758 3.575758 3.505602 3.615067 3.555972
## u12 3.418998 3.440171 3.074245       NA 3.440171
## u13 3.455556 3.455556 3.419717 3.475778 3.455556
## u14 3.045113 3.045113 3.032400 3.078071 3.045113
## u15 3.565217 3.565217 3.565217 3.565217 3.565217
## u16 3.661331 3.688889 3.688889 3.691235 3.679981
## u17 3.693878 3.693878 3.681136 3.685483 3.707989
## u18 3.604167 3.604167 3.493174 3.655711 3.604167
## u19 3.934783 3.934783 3.818656       NA 3.914433
## u20 3.007143 3.005200 3.007143 3.096234 3.007143
## u21 3.905882 3.905882 3.937660 3.937755 3.905882
## u22 4.661972 4.661972 4.484950 4.764733 4.687777
## u23 3.949721 3.949721 3.877516 3.961044 3.967266
## u24 4.392157 4.390376 4.400138 4.485823 4.392157
## u25 3.505860 3.576923 3.467787 3.576923 3.576923

#Estimating RMSE
set.seed(1)

RMSE.model <- Recommender(getData(e, "train"), method = "UBCF", 
                     param=list(normalize = "center", method="Cosine", nn=25))

prediction <- predict(RMSE.model, getData(e, "known"), type="ratings")

rmse_ubcf <- calcPredictionAccuracy(prediction, getData(e, "unknown"))[1]
rmse_ubcf

##     RMSE 
## 1.031304

Item-item collaborative Filtering

The only alterations to the user-user approach are the “IBCF” parameter input, and other self-explanatory axis variable switches.

#Building model
model <- Recommender(real_ratings, method = "IBCF", 
                     param=list(normalize = "center", method="Cosine", k=350))

#Making predictions 
prediction <- predict(model, real_ratings[1:25], type="ratings")
as(prediction, "matrix")[,1:5]

##           m1       m2       m3       m4       m5
## u1  3.606855 3.432844       NA       NA 3.364729
## u2  4.274994 4.100377 4.453378       NA 4.346925
## u3  2.457169       NA 4.000000 2.707015 2.816183
## u4  2.625211 4.000000       NA 3.121320 3.043051
## u5  2.643749 2.661201 2.777874 2.767238 2.758360
## u6  3.473926 3.670153 3.547413 3.690713 3.955558
## u7  2.458963 5.000000 2.912939 3.170524       NA
## u8  4.267140       NA 3.554982       NA 3.692531
## u9  4.113253 3.914352 3.925738 4.172249 3.895860
## u10 2.695618       NA 2.561915 1.339196 2.675744
## u11 3.835681 3.000000 3.521053 3.880685 3.300969
## u12 3.445636 3.797407 3.653746       NA 3.276212
## u13 3.351456 3.619745 3.588417 3.458750 3.669085
## u14 3.066369 3.021650 3.087120 2.997720 3.102580
## u15 3.397783       NA 4.000000       NA 3.598078
## u16 4.122097       NA 3.460183 3.408894 3.913976
## u17 3.697942 4.000000 3.780144 3.800102 3.465968
## u18 3.571328 3.476977 3.862008 3.712382 3.000000
## u19 3.897629 3.951641 4.386356       NA 3.636076
## u20 2.845824 3.130143 3.314887 3.604163 3.037190
## u21 3.674012 4.476616 4.041865 4.152875 4.166474
## u22 4.737747 4.635225 4.616880 4.868661 4.853315
## u23 3.939223 4.316042 3.964131 4.196925 3.742027
## u24 4.733500 4.498465 4.486323 4.512190 4.611544
## u25 3.586579       NA 4.040942 3.454709 3.308393

#Estimating RMSE
set.seed(1)

model <- Recommender(getData(e, "train"), method = "IBCF", 
                     param=list(normalize = "center", method="Cosine",k=350))

prediction <- predict(model, getData(e, "known"), type="ratings")

rmse_ubcf <- calcPredictionAccuracy(prediction, getData(e, "unknown"))[1]
rmse_ubcf

##     RMSE 
## 1.061466

Model application

Based on our RMSE values, our user model is apparently superior. Let’s take an example user “Bob”, or user 610.

real_ratings[610,]

## 1 x 1664 rating matrix of class 'realRatingMatrix' with 295 ratings.

The top 5 items relating to user affinity to Bob are:

recommended.items.u610<- predict(Rec.model, real_ratings[610,], n=5)
as(recommended.items.u610, "list")

## $u610
## [1] "m336"  "m1319" "m500"  "m211"  "m306"

Discussion

This submission really only provides the rubric for a basic collaborative approach and gives a good intial “mapping” of an approach towards a better recommendation model. A hybrid approach of some kind could bolster this method, however a more rigorous optimization of the item-item, and user-user methods would be more wise at this point in time. Many of the parameters available have yet to be tested in not only the recommenderlab R package, but in this particular system itself.

Supportive links

https://rpubs.com/tarashnot/recommender_comparison

https://ashokharnal.wordpress.com/2014/12/18/using-recommenderlab-for-predicting-ratings-for-movielens-data/