Collaborative Filtering Algorithms

library("recommenderlab")

Binary Rating Matrix

In some case the customers rate something as “Good” or “Not Good”, “Like” or “Not Like”. So in these cases we deal with Binary data. Below I create a binary matrix 5x10

## create a 0-1 matrix
m <- matrix(sample(c(0,1), 50, replace=TRUE), nrow=5, ncol=10,
dimnames=list(users=paste("u", 1:5, sep=''),
items=paste("i", 1:10, sep='')))
m

##      items
## users i1 i2 i3 i4 i5 i6 i7 i8 i9 i10
##    u1  1  1  0  1  1  1  0  1  1   1
##    u2  0  0  0  1  0  0  0  0  0   0
##    u3  0  1  0  0  1  1  1  0  1   1
##    u4  0  0  1  1  1  0  0  1  0   1
##    u5  1  1  0  0  0  1  0  0  0   0

## coerce it into a binaryRatingMatrix
b <- as(m, "binaryRatingMatrix")
b

## 5 x 10 rating matrix of class 'binaryRatingMatrix' with 23 ratings.

## coerce it back to see if it worked
as(b, "matrix")

##       i1    i2    i3    i4    i5    i6    i7    i8    i9   i10
## u1  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## u2 FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## u3 FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## u4 FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE
## u5  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

## use some methods defined in ratingMatrix
dim(b)

## [1]  5 10

dimnames(b)

## [[1]]
## [1] "u1" "u2" "u3" "u4" "u5"
## 
## [[2]]
##  [1] "i1"  "i2"  "i3"  "i4"  "i5"  "i6"  "i7"  "i8"  "i9"  "i10"

## counts of Rows and Columns
rowCounts(b)

## u1 u2 u3 u4 u5 
##  8  1  6  5  3

colCounts(b)

##  i1  i2  i3  i4  i5  i6  i7  i8  i9 i10 
##   2   3   1   3   3   3   1   2   2   3

## Plot - It shows with greythe items rated with 1 by the users
image(b)

## sample and subset
sample(b,2)

## 2 x 10 rating matrix of class 'binaryRatingMatrix' with 8 ratings.

b[1:2,1:5]

## 2 x 5 rating matrix of class 'binaryRatingMatrix' with 5 ratings.

## coercion
as(b, "list")

## $u1
## [1] "i1"  "i2"  "i4"  "i5"  "i6"  "i8"  "i9"  "i10"
## 
## $u2
## [1] "i4"
## 
## $u3
## [1] "i2"  "i5"  "i6"  "i7"  "i9"  "i10"
## 
## $u4
## [1] "i3"  "i4"  "i5"  "i8"  "i10"
## 
## $u5
## [1] "i1" "i2" "i6"

head(as(b, "data.frame"))

##    user item rating
## 1    u1   i1      1
## 3    u1   i2      1
## 7    u1   i4      1
## 10   u1   i5      1
## 13   u1   i6      1
## 17   u1   i8      1

head(getData.frame(b, ratings=FALSE))

##    user item
## 1    u1   i1
## 3    u1   i2
## 7    u1   i4
## 10   u1   i5
## 13   u1   i6
## 17   u1   i8

Calculate the Prediction Error for a Recommendation

Calculate prediction accuracy. For predicted ratings MAE (mean average error), MSE (means squared error) and RMSE (root means squared error) are calculated. For topNLists various binary classification metrics are returned.

Below I give an example usiing the Jester5k which contains a 5000 x 100 rating matrix (5000 users and 100 jokes) with ratings between -10.00 and +10.00. All selected users have rated 36 or more jokes. Clearly there are many NA values

### real valued recommender
data(Jester5k)


## create 90/10 split (known/unknown) for the first 500 users in Jester5k
e <- evaluationScheme(Jester5k[1:500,], method="split", train=0.9,
k=1, given=15)
e

## Evaluation scheme with 15 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.900
## Good ratings: NA
## Data set: 500 x 100 rating matrix of class 'realRatingMatrix' with 36345 ratings.

## create a user-based CF recommender using training data
r <- Recommender(getData(e, "train"), "UBCF")

## create predictions for the test data using known ratings (see given above)
p <- predict(r, getData(e, "known"), type="ratings")
p

## 50 x 100 rating matrix of class 'realRatingMatrix' with 4250 ratings.

## compute error metrics averaged per user and then averaged over all
## recommendations
calcPredictionAccuracy(p, getData(e, "unknown"))

##      RMSE       MSE       MAE 
##  4.409095 19.440119  3.459598

head(calcPredictionAccuracy(p, getData(e, "unknown"), byUser=TRUE))

##            RMSE       MSE      MAE
## u5021  2.520403  6.352429 2.027328
## u20139 5.981527 35.778670 5.733758
## u21555 5.125833 26.274169 4.132931
## u1044  2.074061  4.301729 1.551694
## u7905  5.160507 26.630829 4.464032
## u1287  3.951751 15.616338 3.330400

## evaluate topNLists instead (you need to specify given and goodRating!)
p <- predict(r, getData(e, "known"), type="topNList")
p

## Recommendations as 'topNList' with n = 10 for 50 users.

calcPredictionAccuracy(p, getData(e, "unknown"), given=15, goodRating=5)

##          TP          FP          FN          TN   precision      recall 
##  4.30000000  5.70000000 13.14000000 61.86000000  0.43000000  0.30410138 
##         TPR         FPR 
##  0.30410138  0.07972832

## evaluate a binary recommender
data(MSWeb)
MSWeb10 <- sample(MSWeb[rowCounts(MSWeb) >10,], 50)

e <- evaluationScheme(MSWeb10, method="split", train=0.9,
k=1, given=3)
e

## Evaluation scheme with 3 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.900
## Good ratings: NA
## Data set: 50 x 285 rating matrix of class 'binaryRatingMatrix' with 656 ratings.

## create a user-based CF recommender using training data
r <- Recommender(getData(e, "train"), "UBCF")
## create predictions for the test data using known ratings (see given above)
p <- predict(r, getData(e, "known"), type="topNList", n=10)
p

## Recommendations as 'topNList' with n = 10 for 5 users.

calcPredictionAccuracy(p, getData(e, "unknown"), given=3)

##           TP           FP           FN           TN    precision 
##   4.40000000   5.60000000   5.80000000 266.20000000   0.44000000 
##       recall          TPR          FPR 
##   0.43787879   0.43787879   0.02060226

Similariy and Dissimilarity

Similarities are computed from dissimilarities using s=1/(1+d) or s=1-d depending on the measure

data(MSWeb)
## between 5 users
dissimilarity(MSWeb[1:5,], method = "jaccard")

##           1         2         3         4
## 2 0.7500000                              
## 3 0.8000000 0.3333333                    
## 4 1.0000000 1.0000000 1.0000000          
## 5 1.0000000 1.0000000 1.0000000 1.0000000

similarity(MSWeb[1:5,], method = "jaccard")

##           1         2         3         4
## 2 0.2500000                              
## 3 0.2000000 0.6666667                    
## 4 0.0000000 0.0000000 0.0000000          
## 5 0.0000000 0.0000000 0.0000000 0.0000000

## between first 3 items
dissimilarity(MSWeb[,1:3], method = "jaccard", which = "items")

##                           regwiz Support Desktop
## Support Desktop        0.9407466                
## End User Produced View 0.9753239       0.9533011

similarity(MSWeb[,1:3], method = "jaccard", which = "items")

##                            regwiz Support Desktop
## Support Desktop        0.05925341                
## End User Produced View 0.02467613      0.04669887

## cross-similarity between first 2 users and users 10-20
similarity(MSWeb[1:2,], MSWeb[10:20,], method="jaccard")

## An object of class "ar_cross_dissimilarity"
##      10 11 12 13 14 15 16 17 18         19  20
## 1 0.125  0  0  0  0  0  0  0  0 0.07692308 0.4
## 2 0.000  0  0  0  0  0  0  0  0 0.08333333 0.2
## Slot "method":
## [1] "jaccard"

Error Calculation

Calculate the mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE) and for matrices also the Frobenius norm (identical to RMSE).

true <- rnorm(10)
predicted <- rnorm(10)
MAE(true, predicted)

## [1] 1.191958

MSE(true, predicted)

## [1] 1.984111

RMSE(true, predicted)

## [1] 1.408585

true <- matrix(rnorm(9), nrow = 3)
predicted <- matrix(rnorm(9), nrow = 3)
frobenius(true, predicted)

## [1] 0.9994234

Evaluate a Recommender Model

### evaluate top-N list recommendations on a 0-1 data set
data("MSWeb")
MSWeb10 <- sample(MSWeb[rowCounts(MSWeb) >10,], 20)
## create an evaluation scheme
es <- evaluationScheme(MSWeb10, method="cross-validation",
k=4, given=3)
## run evaluation
ev <- evaluate(es, "POPULAR", n=c(1,3,5,10))

## POPULAR run fold/sample [model time/prediction time]
##   1  [0sec/0.01sec] 
##   2  [0sec/0.01sec] 
##   3  [0sec/0.02sec] 
##   4  [0sec/0.03sec]

ev

## Evaluation results for 4 folds/samples using method 'POPULAR'.

## look at the results (by the length of the topNList)
avg(ev)

##      TP   FP   FN     TN precision     recall        TPR         FPR
## 1  0.55 0.45 9.10 271.90      0.55 0.06232323 0.06232323 0.001657905
## 3  1.95 1.05 7.70 271.30      0.65 0.20719697 0.20719697 0.003855063
## 5  2.85 2.15 6.80 270.20      0.57 0.30837121 0.30837121 0.007902131
## 10 4.00 6.00 5.65 266.35      0.40 0.42537247 0.42537247 0.022029331

plot(ev, annotate = TRUE)

## now run evaluate with a list
algorithms <- list(
RANDOM = list(name = "RANDOM", param = NULL),
POPULAR = list(name = "POPULAR", param = NULL)
)
evlist <- evaluate(es, algorithms, n=c(1,3,5,10))

## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0sec] 
##   2  [0sec/0.02sec] 
##   3  [0sec/0.01sec] 
##   4  [0sec/0.02sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.04sec] 
##   3  [0sec/0.02sec] 
##   4  [0sec/0.02sec]

plot(evlist, legend="topright")

## select the first results
evlist[[1]]

## Evaluation results for 4 folds/samples using method 'RANDOM'.

### Evaluate using a data set with real-valued ratings
data("Jester5k")
es <- evaluationScheme(Jester5k[1:25], method="cross-validation",
k=4, given=10, goodRating=5)
## Note: goodRating is used to determine positive ratings
## predict top-N recommendation lists
ev <- evaluate(es, "RANDOM", type="topNList", n=10)

## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.01sec] 
##   2  [0sec/0.01sec] 
##   3  [0sec/0sec] 
##   4  [0sec/0sec]

avg(ev)

##     TP  FP   FN   TN precision     recall        TPR       FPR
## 10 1.5 8.5 11.5 68.5      0.15 0.09171105 0.09171105 0.1086491

## predict missing ratings
ev <- evaluate(es, "RANDOM", type="ratings")

## RANDOM run fold/sample [model time/prediction time]
##   1  [0.02sec/0sec] 
##   2  [0sec/0sec] 
##   3  [0sec/0sec] 
##   4  [0sec/0sec]

avg(ev)

##         RMSE      MSE      MAE
## res 7.769319 60.53158 6.274149

Creator Function for evaluationSchema

Creates an evaluationScheme object from a data set. The scheme can be a simple split into training and test data, k-fold cross-evaluation or using k independent bootstrap samples.

data("MSWeb")
MSWeb10 <- sample(MSWeb[rowCounts(MSWeb) >10,], 50)
MSWeb10

## 50 x 285 rating matrix of class 'binaryRatingMatrix' with 696 ratings.

## simple split with 3 items given
esSplit <- evaluationScheme(MSWeb10, method="split",
train = 0.9, k=1, given=3)
esSplit

## Evaluation scheme with 3 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.900
## Good ratings: NA
## Data set: 50 x 285 rating matrix of class 'binaryRatingMatrix' with 696 ratings.

## 4-fold cross-validation with all-but-1 items for learning.

esCross <- evaluationScheme(MSWeb10, method="cross-validation",
k=4, given=1)
esCross

## Evaluation scheme with 1 items given
## Method: 'cross-validation' with 4 run(s).
## Good ratings: NA
## Data set: 50 x 285 rating matrix of class 'binaryRatingMatrix' with 696 ratings.

Funk SVD for Matrices with Missing Data

Implements matrix decomposition by the stochastic gradient descent optimization popularized by Simon Funk to minimize the error on the known values.

### this takes a while to run
## Not run:
data("Jester5k")
train <- as(Jester5k[1:100], "matrix")
fsvd <- funkSVD(train, verbose = TRUE)

## 
## Training feature: 1 / 10 : .........................................................................
## ->  73 epochs - final improvement was 9.602368e-07 
## 
## Training feature: 2 / 10 : ....................................................................................................................
## ->  116 epochs - final improvement was 9.917546e-07 
## 
## Training feature: 3 / 10 : ........................................................................................................................................................................................................
## ->  200 epochs - final improvement was 1.248391e-05 
## 
## Training feature: 4 / 10 : ........................................................................................................................................................................................................
## ->  200 epochs - final improvement was 1.965783e-05 
## 
## Training feature: 5 / 10 : ........................................................................................................................................................................................................
## ->  200 epochs - final improvement was 6.33653e-05 
## 
## Training feature: 6 / 10 : ........................................................................................................................................................................................................
## ->  200 epochs - final improvement was 0.0001152599 
## 
## Training feature: 7 / 10 : ........................................................................................................................................................................................................
## ->  200 epochs - final improvement was 0.000157843 
## 
## Training feature: 8 / 10 : ........................................................................................................................................................................................................
## ->  200 epochs - final improvement was 5.163511e-05 
## 
## Training feature: 9 / 10 : ........................................................................................................................................................................................................
## ->  200 epochs - final improvement was 0.0002243435 
## 
## Training feature: 10 / 10 : ........................................................................................................................................................................................................
## ->  200 epochs - final improvement was 1.215702e-05

### reconstruct the rating matrix as R = UV'
### and calculate the root mean square error on the known ratings
r <- tcrossprod(fsvd$U, fsvd$V)
RMSE(train, r)

## [1] 3.201447

### fold in new users for matrix completion
test <- as(Jester5k[101:105], "matrix")
p <- predict(fsvd, test, verbose = TRUE)

## 
## Estimating user feature: 1 / 10 : ..................................................
## ->  50 epochs - final improvement was 1.932339e-08 
## 
## Estimating user feature: 2 / 10 : .......................................................................
## ->  71 epochs - final improvement was 9.330019e-07 
## 
## Estimating user feature: 3 / 10 : ............................................................................................
## ->  92 epochs - final improvement was 9.639379e-07 
## 
## Estimating user feature: 4 / 10 : .....................................................................................................................
## ->  117 epochs - final improvement was 9.563317e-07 
## 
## Estimating user feature: 5 / 10 : ........................................................................................................
## ->  104 epochs - final improvement was 9.502989e-07 
## 
## Estimating user feature: 6 / 10 : ...............................................................................
## ->  79 epochs - final improvement was 9.330617e-07 
## 
## Estimating user feature: 7 / 10 : ................................................................................................
## ->  96 epochs - final improvement was 9.953439e-07 
## 
## Estimating user feature: 8 / 10 : .......................................................................
## ->  71 epochs - final improvement was 9.516152e-07 
## 
## Estimating user feature: 9 / 10 : .........................................................................................
## ->  89 epochs - final improvement was 9.391141e-07 
## 
## Estimating user feature: 10 / 10 : ...............................................................................
## ->  79 epochs - final improvement was 9.599309e-07

RMSE(test, p)

## [1] 3.488927

List and Data.frame Representation for Recommender Matrix Objects

##Examples
data(Jester5k)
getList(Jester5k[1,])

## $u2841
##    j1    j2    j3    j4    j5    j6    j7    j8    j9   j10   j11   j12 
##  7.91  9.17  5.34  8.16 -8.74  7.14  8.88 -8.25  5.87  6.21  7.72  6.12 
##   j13   j14   j15   j16   j17   j18   j19   j20   j21   j22   j23   j24 
## -0.73  7.77 -5.83 -8.88  8.98 -9.32 -9.08 -9.13  7.77  8.59  5.29  8.25 
##   j25   j26   j27   j28   j29   j30   j31   j32   j33   j34   j35   j36 
##  6.02  5.24  7.82  7.96 -8.88  8.25  3.64 -0.73  8.25  5.34 -7.77 -9.76 
##   j37   j38   j39   j40   j41   j42   j43   j44   j45   j46   j47   j48 
##  7.04  5.78  8.06  7.23  8.45  9.08  6.75  5.87  8.45 -9.42  5.15  8.74 
##   j49   j50   j51   j52   j53   j54   j55   j56   j57   j58   j59   j60 
##  6.41  8.64  8.45  9.13 -8.79  6.17  8.25  6.89  5.73  5.73  8.20  6.46 
##   j61   j62   j63   j64   j65   j66   j67   j68   j69   j70   j77   j91 
##  8.64  3.59  7.28  8.25  4.81 -8.20  5.73  7.04  4.56  8.79 -9.71  7.57 
##   j92   j93   j94   j95   j96   j97   j98   j99  j100 
## -9.42 -9.27  7.62  7.77  8.20  6.60  7.33  9.17  8.88

getData.frame(Jester5k[1,])

##     user item rating
## 1  u2841   j1   7.91
## 2  u2841   j2   9.17
## 3  u2841   j3   5.34
## 4  u2841   j4   8.16
## 5  u2841   j5  -8.74
## 6  u2841   j6   7.14
## 7  u2841   j7   8.88
## 8  u2841   j8  -8.25
## 9  u2841   j9   5.87
## 10 u2841  j10   6.21
## 11 u2841  j11   7.72
## 12 u2841  j12   6.12
## 13 u2841  j13  -0.73
## 14 u2841  j14   7.77
## 15 u2841  j15  -5.83
## 16 u2841  j16  -8.88
## 17 u2841  j17   8.98
## 18 u2841  j18  -9.32
## 19 u2841  j19  -9.08
## 20 u2841  j20  -9.13
## 21 u2841  j21   7.77
## 22 u2841  j22   8.59
## 23 u2841  j23   5.29
## 24 u2841  j24   8.25
## 25 u2841  j25   6.02
## 26 u2841  j26   5.24
## 27 u2841  j27   7.82
## 28 u2841  j28   7.96
## 29 u2841  j29  -8.88
## 30 u2841  j30   8.25
## 31 u2841  j31   3.64
## 32 u2841  j32  -0.73
## 33 u2841  j33   8.25
## 34 u2841  j34   5.34
## 35 u2841  j35  -7.77
## 36 u2841  j36  -9.76
## 37 u2841  j37   7.04
## 38 u2841  j38   5.78
## 39 u2841  j39   8.06
## 40 u2841  j40   7.23
## 41 u2841  j41   8.45
## 42 u2841  j42   9.08
## 43 u2841  j43   6.75
## 44 u2841  j44   5.87
## 45 u2841  j45   8.45
## 46 u2841  j46  -9.42
## 47 u2841  j47   5.15
## 48 u2841  j48   8.74
## 49 u2841  j49   6.41
## 50 u2841  j50   8.64
## 51 u2841  j51   8.45
## 52 u2841  j52   9.13
## 53 u2841  j53  -8.79
## 54 u2841  j54   6.17
## 55 u2841  j55   8.25
## 56 u2841  j56   6.89
## 57 u2841  j57   5.73
## 58 u2841  j58   5.73
## 59 u2841  j59   8.20
## 60 u2841  j60   6.46
## 61 u2841  j61   8.64
## 62 u2841  j62   3.59
## 63 u2841  j63   7.28
## 64 u2841  j64   8.25
## 65 u2841  j65   4.81
## 66 u2841  j66  -8.20
## 67 u2841  j67   5.73
## 68 u2841  j68   7.04
## 69 u2841  j69   4.56
## 70 u2841  j70   8.79
## 71 u2841  j77  -9.71
## 72 u2841  j91   7.57
## 73 u2841  j92  -9.42
## 74 u2841  j93  -9.27
## 75 u2841  j94   7.62
## 76 u2841  j95   7.77
## 77 u2841  j96   8.20
## 78 u2841  j97   6.60
## 79 u2841  j98   7.33
## 80 u2841  j99   9.17
## 81 u2841 j100   8.88

data(Jester5k)
Jester5k

## 5000 x 100 rating matrix of class 'realRatingMatrix' with 362106 ratings.

## number of ratings
nratings(Jester5k)

## [1] 362106

## number of ratings per user
summary(rowCounts(Jester5k))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   36.00   53.00   72.00   72.42  100.00  100.00

## rating distribution
hist(getRatings(Jester5k), main="Distribution of ratings")

## 'best' joke with highest average rating
best <- which.max(colMeans(Jester5k))
cat(JesterJokes[best])

## A guy goes into confession and says to the priest, "Father, I'm 80 years old, widower, with 11 grandchildren. Last night I met two beautiful flight attendants. They took me home and I made love to both of them. Twice." The priest said: "Well, my son, when was the last time you were in confession?" "Never Father, I'm Jewish." "So then, why are you telling me?" "I'm telling everybody."

Create a Recommender Model

Recommender uses the registry mechanism from package registry to manage methods. This let’s the user easily specify and add new methods. The registry is called recommenderRegistry. See examples section.

data("MSWeb")
MSWeb10 <- sample(MSWeb[rowCounts(MSWeb) >10,], 100)
rec <- Recommender(MSWeb10, method = "POPULAR")
rec

## Recommender of type 'POPULAR' for 'binaryRatingMatrix' 
## learned using 100 users.

str(getModel(rec))

## List of 1
##  $ topN:Formal class 'topNList' [package "recommenderlab"] with 4 slots
##   .. ..@ items     :List of 1
##   .. .. ..$ : int [1:285] 9 19 5 18 2 10 35 27 4 42 ...
##   .. ..@ ratings   : NULL
##   .. ..@ itemLabels: chr [1:285] "regwiz" "Support Desktop" "End User Produced View" "Knowledge Base" ...
##   .. ..@ n         : int 285

## look at registry and a few methods
recommenderRegistry$get_entry_names()

##  [1] "AR_binaryRatingMatrix"      "IBCF_binaryRatingMatrix"   
##  [3] "IBCF_realRatingMatrix"      "POPULAR_binaryRatingMatrix"
##  [5] "POPULAR_realRatingMatrix"   "RANDOM_realRatingMatrix"   
##  [7] "RANDOM_binaryRatingMatrix"  "SVD_realRatingMatrix"      
##  [9] "SVDF_realRatingMatrix"      "UBCF_binaryRatingMatrix"   
## [11] "UBCF_realRatingMatrix"

recommenderRegistry$get_entry("POPULAR", dataType = "binaryRatingMatrix")

## Recommender method: POPULAR
## Description: Recommender based on item popularity (binary data).
## Parameters: None

recommenderRegistry$get_entry("SVD", dataType = "realRatingMatrix")

## Recommender method: SVD
## Description: Recommender based on SVD approximation with column-mean imputation (real data).
## Parameters:
##    k maxiter normalize minRating
## 1 10     100    center        NA