library("recommenderlab")
In some case the customers rate something as “Good” or “Not Good”, “Like” or “Not Like”. So in these cases we deal with Binary data. Below I create a binary matrix 5x10
## create a 0-1 matrix
m <- matrix(sample(c(0,1), 50, replace=TRUE), nrow=5, ncol=10,
dimnames=list(users=paste("u", 1:5, sep=''),
items=paste("i", 1:10, sep='')))
m
## items
## users i1 i2 i3 i4 i5 i6 i7 i8 i9 i10
## u1 1 1 0 1 1 1 0 1 1 1
## u2 0 0 0 1 0 0 0 0 0 0
## u3 0 1 0 0 1 1 1 0 1 1
## u4 0 0 1 1 1 0 0 1 0 1
## u5 1 1 0 0 0 1 0 0 0 0
## coerce it into a binaryRatingMatrix
b <- as(m, "binaryRatingMatrix")
b
## 5 x 10 rating matrix of class 'binaryRatingMatrix' with 23 ratings.
## coerce it back to see if it worked
as(b, "matrix")
## i1 i2 i3 i4 i5 i6 i7 i8 i9 i10
## u1 TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## u2 FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## u3 FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
## u4 FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
## u5 TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## use some methods defined in ratingMatrix
dim(b)
## [1] 5 10
dimnames(b)
## [[1]]
## [1] "u1" "u2" "u3" "u4" "u5"
##
## [[2]]
## [1] "i1" "i2" "i3" "i4" "i5" "i6" "i7" "i8" "i9" "i10"
## counts of Rows and Columns
rowCounts(b)
## u1 u2 u3 u4 u5
## 8 1 6 5 3
colCounts(b)
## i1 i2 i3 i4 i5 i6 i7 i8 i9 i10
## 2 3 1 3 3 3 1 2 2 3
## Plot - It shows with greythe items rated with 1 by the users
image(b)
## sample and subset
sample(b,2)
## 2 x 10 rating matrix of class 'binaryRatingMatrix' with 8 ratings.
b[1:2,1:5]
## 2 x 5 rating matrix of class 'binaryRatingMatrix' with 5 ratings.
## coercion
as(b, "list")
## $u1
## [1] "i1" "i2" "i4" "i5" "i6" "i8" "i9" "i10"
##
## $u2
## [1] "i4"
##
## $u3
## [1] "i2" "i5" "i6" "i7" "i9" "i10"
##
## $u4
## [1] "i3" "i4" "i5" "i8" "i10"
##
## $u5
## [1] "i1" "i2" "i6"
head(as(b, "data.frame"))
## user item rating
## 1 u1 i1 1
## 3 u1 i2 1
## 7 u1 i4 1
## 10 u1 i5 1
## 13 u1 i6 1
## 17 u1 i8 1
head(getData.frame(b, ratings=FALSE))
## user item
## 1 u1 i1
## 3 u1 i2
## 7 u1 i4
## 10 u1 i5
## 13 u1 i6
## 17 u1 i8
Calculate prediction accuracy. For predicted ratings MAE (mean average error), MSE (means squared error) and RMSE (root means squared error) are calculated. For topNLists various binary classification metrics are returned.
Below I give an example usiing the Jester5k which contains a 5000 x 100 rating matrix (5000 users and 100 jokes) with ratings between -10.00 and +10.00. All selected users have rated 36 or more jokes. Clearly there are many NA values
### real valued recommender
data(Jester5k)
## create 90/10 split (known/unknown) for the first 500 users in Jester5k
e <- evaluationScheme(Jester5k[1:500,], method="split", train=0.9,
k=1, given=15)
e
## Evaluation scheme with 15 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.900
## Good ratings: NA
## Data set: 500 x 100 rating matrix of class 'realRatingMatrix' with 36345 ratings.
## create a user-based CF recommender using training data
r <- Recommender(getData(e, "train"), "UBCF")
## create predictions for the test data using known ratings (see given above)
p <- predict(r, getData(e, "known"), type="ratings")
p
## 50 x 100 rating matrix of class 'realRatingMatrix' with 4250 ratings.
## compute error metrics averaged per user and then averaged over all
## recommendations
calcPredictionAccuracy(p, getData(e, "unknown"))
## RMSE MSE MAE
## 4.409095 19.440119 3.459598
head(calcPredictionAccuracy(p, getData(e, "unknown"), byUser=TRUE))
## RMSE MSE MAE
## u5021 2.520403 6.352429 2.027328
## u20139 5.981527 35.778670 5.733758
## u21555 5.125833 26.274169 4.132931
## u1044 2.074061 4.301729 1.551694
## u7905 5.160507 26.630829 4.464032
## u1287 3.951751 15.616338 3.330400
## evaluate topNLists instead (you need to specify given and goodRating!)
p <- predict(r, getData(e, "known"), type="topNList")
p
## Recommendations as 'topNList' with n = 10 for 50 users.
calcPredictionAccuracy(p, getData(e, "unknown"), given=15, goodRating=5)
## TP FP FN TN precision recall
## 4.30000000 5.70000000 13.14000000 61.86000000 0.43000000 0.30410138
## TPR FPR
## 0.30410138 0.07972832
## evaluate a binary recommender
data(MSWeb)
MSWeb10 <- sample(MSWeb[rowCounts(MSWeb) >10,], 50)
e <- evaluationScheme(MSWeb10, method="split", train=0.9,
k=1, given=3)
e
## Evaluation scheme with 3 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.900
## Good ratings: NA
## Data set: 50 x 285 rating matrix of class 'binaryRatingMatrix' with 656 ratings.
## create a user-based CF recommender using training data
r <- Recommender(getData(e, "train"), "UBCF")
## create predictions for the test data using known ratings (see given above)
p <- predict(r, getData(e, "known"), type="topNList", n=10)
p
## Recommendations as 'topNList' with n = 10 for 5 users.
calcPredictionAccuracy(p, getData(e, "unknown"), given=3)
## TP FP FN TN precision
## 4.40000000 5.60000000 5.80000000 266.20000000 0.44000000
## recall TPR FPR
## 0.43787879 0.43787879 0.02060226
Similarities are computed from dissimilarities using s=1/(1+d) or s=1-d depending on the measure
data(MSWeb)
## between 5 users
dissimilarity(MSWeb[1:5,], method = "jaccard")
## 1 2 3 4
## 2 0.7500000
## 3 0.8000000 0.3333333
## 4 1.0000000 1.0000000 1.0000000
## 5 1.0000000 1.0000000 1.0000000 1.0000000
similarity(MSWeb[1:5,], method = "jaccard")
## 1 2 3 4
## 2 0.2500000
## 3 0.2000000 0.6666667
## 4 0.0000000 0.0000000 0.0000000
## 5 0.0000000 0.0000000 0.0000000 0.0000000
## between first 3 items
dissimilarity(MSWeb[,1:3], method = "jaccard", which = "items")
## regwiz Support Desktop
## Support Desktop 0.9407466
## End User Produced View 0.9753239 0.9533011
similarity(MSWeb[,1:3], method = "jaccard", which = "items")
## regwiz Support Desktop
## Support Desktop 0.05925341
## End User Produced View 0.02467613 0.04669887
## cross-similarity between first 2 users and users 10-20
similarity(MSWeb[1:2,], MSWeb[10:20,], method="jaccard")
## An object of class "ar_cross_dissimilarity"
## 10 11 12 13 14 15 16 17 18 19 20
## 1 0.125 0 0 0 0 0 0 0 0 0.07692308 0.4
## 2 0.000 0 0 0 0 0 0 0 0 0.08333333 0.2
## Slot "method":
## [1] "jaccard"
Calculate the mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE) and for matrices also the Frobenius norm (identical to RMSE).
true <- rnorm(10)
predicted <- rnorm(10)
MAE(true, predicted)
## [1] 1.191958
MSE(true, predicted)
## [1] 1.984111
RMSE(true, predicted)
## [1] 1.408585
true <- matrix(rnorm(9), nrow = 3)
predicted <- matrix(rnorm(9), nrow = 3)
frobenius(true, predicted)
## [1] 0.9994234
### evaluate top-N list recommendations on a 0-1 data set
data("MSWeb")
MSWeb10 <- sample(MSWeb[rowCounts(MSWeb) >10,], 20)
## create an evaluation scheme
es <- evaluationScheme(MSWeb10, method="cross-validation",
k=4, given=3)
## run evaluation
ev <- evaluate(es, "POPULAR", n=c(1,3,5,10))
## POPULAR run fold/sample [model time/prediction time]
## 1 [0sec/0.01sec]
## 2 [0sec/0.01sec]
## 3 [0sec/0.02sec]
## 4 [0sec/0.03sec]
ev
## Evaluation results for 4 folds/samples using method 'POPULAR'.
## look at the results (by the length of the topNList)
avg(ev)
## TP FP FN TN precision recall TPR FPR
## 1 0.55 0.45 9.10 271.90 0.55 0.06232323 0.06232323 0.001657905
## 3 1.95 1.05 7.70 271.30 0.65 0.20719697 0.20719697 0.003855063
## 5 2.85 2.15 6.80 270.20 0.57 0.30837121 0.30837121 0.007902131
## 10 4.00 6.00 5.65 266.35 0.40 0.42537247 0.42537247 0.022029331
plot(ev, annotate = TRUE)
## now run evaluate with a list
algorithms <- list(
RANDOM = list(name = "RANDOM", param = NULL),
POPULAR = list(name = "POPULAR", param = NULL)
)
evlist <- evaluate(es, algorithms, n=c(1,3,5,10))
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/0sec]
## 2 [0sec/0.02sec]
## 3 [0sec/0.01sec]
## 4 [0sec/0.02sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0sec/0.02sec]
## 2 [0sec/0.04sec]
## 3 [0sec/0.02sec]
## 4 [0sec/0.02sec]
plot(evlist, legend="topright")
## select the first results
evlist[[1]]
## Evaluation results for 4 folds/samples using method 'RANDOM'.
### Evaluate using a data set with real-valued ratings
data("Jester5k")
es <- evaluationScheme(Jester5k[1:25], method="cross-validation",
k=4, given=10, goodRating=5)
## Note: goodRating is used to determine positive ratings
## predict top-N recommendation lists
ev <- evaluate(es, "RANDOM", type="topNList", n=10)
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/0.01sec]
## 2 [0sec/0.01sec]
## 3 [0sec/0sec]
## 4 [0sec/0sec]
avg(ev)
## TP FP FN TN precision recall TPR FPR
## 10 1.5 8.5 11.5 68.5 0.15 0.09171105 0.09171105 0.1086491
## predict missing ratings
ev <- evaluate(es, "RANDOM", type="ratings")
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.02sec/0sec]
## 2 [0sec/0sec]
## 3 [0sec/0sec]
## 4 [0sec/0sec]
avg(ev)
## RMSE MSE MAE
## res 7.769319 60.53158 6.274149
Creates an evaluationScheme object from a data set. The scheme can be a simple split into training and test data, k-fold cross-evaluation or using k independent bootstrap samples.
data("MSWeb")
MSWeb10 <- sample(MSWeb[rowCounts(MSWeb) >10,], 50)
MSWeb10
## 50 x 285 rating matrix of class 'binaryRatingMatrix' with 696 ratings.
## simple split with 3 items given
esSplit <- evaluationScheme(MSWeb10, method="split",
train = 0.9, k=1, given=3)
esSplit
## Evaluation scheme with 3 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.900
## Good ratings: NA
## Data set: 50 x 285 rating matrix of class 'binaryRatingMatrix' with 696 ratings.
## 4-fold cross-validation with all-but-1 items for learning.
esCross <- evaluationScheme(MSWeb10, method="cross-validation",
k=4, given=1)
esCross
## Evaluation scheme with 1 items given
## Method: 'cross-validation' with 4 run(s).
## Good ratings: NA
## Data set: 50 x 285 rating matrix of class 'binaryRatingMatrix' with 696 ratings.
Implements matrix decomposition by the stochastic gradient descent optimization popularized by Simon Funk to minimize the error on the known values.
### this takes a while to run
## Not run:
data("Jester5k")
train <- as(Jester5k[1:100], "matrix")
fsvd <- funkSVD(train, verbose = TRUE)
##
## Training feature: 1 / 10 : .........................................................................
## -> 73 epochs - final improvement was 9.602368e-07
##
## Training feature: 2 / 10 : ....................................................................................................................
## -> 116 epochs - final improvement was 9.917546e-07
##
## Training feature: 3 / 10 : ........................................................................................................................................................................................................
## -> 200 epochs - final improvement was 1.248391e-05
##
## Training feature: 4 / 10 : ........................................................................................................................................................................................................
## -> 200 epochs - final improvement was 1.965783e-05
##
## Training feature: 5 / 10 : ........................................................................................................................................................................................................
## -> 200 epochs - final improvement was 6.33653e-05
##
## Training feature: 6 / 10 : ........................................................................................................................................................................................................
## -> 200 epochs - final improvement was 0.0001152599
##
## Training feature: 7 / 10 : ........................................................................................................................................................................................................
## -> 200 epochs - final improvement was 0.000157843
##
## Training feature: 8 / 10 : ........................................................................................................................................................................................................
## -> 200 epochs - final improvement was 5.163511e-05
##
## Training feature: 9 / 10 : ........................................................................................................................................................................................................
## -> 200 epochs - final improvement was 0.0002243435
##
## Training feature: 10 / 10 : ........................................................................................................................................................................................................
## -> 200 epochs - final improvement was 1.215702e-05
### reconstruct the rating matrix as R = UV'
### and calculate the root mean square error on the known ratings
r <- tcrossprod(fsvd$U, fsvd$V)
RMSE(train, r)
## [1] 3.201447
### fold in new users for matrix completion
test <- as(Jester5k[101:105], "matrix")
p <- predict(fsvd, test, verbose = TRUE)
##
## Estimating user feature: 1 / 10 : ..................................................
## -> 50 epochs - final improvement was 1.932339e-08
##
## Estimating user feature: 2 / 10 : .......................................................................
## -> 71 epochs - final improvement was 9.330019e-07
##
## Estimating user feature: 3 / 10 : ............................................................................................
## -> 92 epochs - final improvement was 9.639379e-07
##
## Estimating user feature: 4 / 10 : .....................................................................................................................
## -> 117 epochs - final improvement was 9.563317e-07
##
## Estimating user feature: 5 / 10 : ........................................................................................................
## -> 104 epochs - final improvement was 9.502989e-07
##
## Estimating user feature: 6 / 10 : ...............................................................................
## -> 79 epochs - final improvement was 9.330617e-07
##
## Estimating user feature: 7 / 10 : ................................................................................................
## -> 96 epochs - final improvement was 9.953439e-07
##
## Estimating user feature: 8 / 10 : .......................................................................
## -> 71 epochs - final improvement was 9.516152e-07
##
## Estimating user feature: 9 / 10 : .........................................................................................
## -> 89 epochs - final improvement was 9.391141e-07
##
## Estimating user feature: 10 / 10 : ...............................................................................
## -> 79 epochs - final improvement was 9.599309e-07
RMSE(test, p)
## [1] 3.488927
##Examples
data(Jester5k)
getList(Jester5k[1,])
## $u2841
## j1 j2 j3 j4 j5 j6 j7 j8 j9 j10 j11 j12
## 7.91 9.17 5.34 8.16 -8.74 7.14 8.88 -8.25 5.87 6.21 7.72 6.12
## j13 j14 j15 j16 j17 j18 j19 j20 j21 j22 j23 j24
## -0.73 7.77 -5.83 -8.88 8.98 -9.32 -9.08 -9.13 7.77 8.59 5.29 8.25
## j25 j26 j27 j28 j29 j30 j31 j32 j33 j34 j35 j36
## 6.02 5.24 7.82 7.96 -8.88 8.25 3.64 -0.73 8.25 5.34 -7.77 -9.76
## j37 j38 j39 j40 j41 j42 j43 j44 j45 j46 j47 j48
## 7.04 5.78 8.06 7.23 8.45 9.08 6.75 5.87 8.45 -9.42 5.15 8.74
## j49 j50 j51 j52 j53 j54 j55 j56 j57 j58 j59 j60
## 6.41 8.64 8.45 9.13 -8.79 6.17 8.25 6.89 5.73 5.73 8.20 6.46
## j61 j62 j63 j64 j65 j66 j67 j68 j69 j70 j77 j91
## 8.64 3.59 7.28 8.25 4.81 -8.20 5.73 7.04 4.56 8.79 -9.71 7.57
## j92 j93 j94 j95 j96 j97 j98 j99 j100
## -9.42 -9.27 7.62 7.77 8.20 6.60 7.33 9.17 8.88
getData.frame(Jester5k[1,])
## user item rating
## 1 u2841 j1 7.91
## 2 u2841 j2 9.17
## 3 u2841 j3 5.34
## 4 u2841 j4 8.16
## 5 u2841 j5 -8.74
## 6 u2841 j6 7.14
## 7 u2841 j7 8.88
## 8 u2841 j8 -8.25
## 9 u2841 j9 5.87
## 10 u2841 j10 6.21
## 11 u2841 j11 7.72
## 12 u2841 j12 6.12
## 13 u2841 j13 -0.73
## 14 u2841 j14 7.77
## 15 u2841 j15 -5.83
## 16 u2841 j16 -8.88
## 17 u2841 j17 8.98
## 18 u2841 j18 -9.32
## 19 u2841 j19 -9.08
## 20 u2841 j20 -9.13
## 21 u2841 j21 7.77
## 22 u2841 j22 8.59
## 23 u2841 j23 5.29
## 24 u2841 j24 8.25
## 25 u2841 j25 6.02
## 26 u2841 j26 5.24
## 27 u2841 j27 7.82
## 28 u2841 j28 7.96
## 29 u2841 j29 -8.88
## 30 u2841 j30 8.25
## 31 u2841 j31 3.64
## 32 u2841 j32 -0.73
## 33 u2841 j33 8.25
## 34 u2841 j34 5.34
## 35 u2841 j35 -7.77
## 36 u2841 j36 -9.76
## 37 u2841 j37 7.04
## 38 u2841 j38 5.78
## 39 u2841 j39 8.06
## 40 u2841 j40 7.23
## 41 u2841 j41 8.45
## 42 u2841 j42 9.08
## 43 u2841 j43 6.75
## 44 u2841 j44 5.87
## 45 u2841 j45 8.45
## 46 u2841 j46 -9.42
## 47 u2841 j47 5.15
## 48 u2841 j48 8.74
## 49 u2841 j49 6.41
## 50 u2841 j50 8.64
## 51 u2841 j51 8.45
## 52 u2841 j52 9.13
## 53 u2841 j53 -8.79
## 54 u2841 j54 6.17
## 55 u2841 j55 8.25
## 56 u2841 j56 6.89
## 57 u2841 j57 5.73
## 58 u2841 j58 5.73
## 59 u2841 j59 8.20
## 60 u2841 j60 6.46
## 61 u2841 j61 8.64
## 62 u2841 j62 3.59
## 63 u2841 j63 7.28
## 64 u2841 j64 8.25
## 65 u2841 j65 4.81
## 66 u2841 j66 -8.20
## 67 u2841 j67 5.73
## 68 u2841 j68 7.04
## 69 u2841 j69 4.56
## 70 u2841 j70 8.79
## 71 u2841 j77 -9.71
## 72 u2841 j91 7.57
## 73 u2841 j92 -9.42
## 74 u2841 j93 -9.27
## 75 u2841 j94 7.62
## 76 u2841 j95 7.77
## 77 u2841 j96 8.20
## 78 u2841 j97 6.60
## 79 u2841 j98 7.33
## 80 u2841 j99 9.17
## 81 u2841 j100 8.88
data(Jester5k)
Jester5k
## 5000 x 100 rating matrix of class 'realRatingMatrix' with 362106 ratings.
## number of ratings
nratings(Jester5k)
## [1] 362106
## number of ratings per user
summary(rowCounts(Jester5k))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 36.00 53.00 72.00 72.42 100.00 100.00
## rating distribution
hist(getRatings(Jester5k), main="Distribution of ratings")
## 'best' joke with highest average rating
best <- which.max(colMeans(Jester5k))
cat(JesterJokes[best])
## A guy goes into confession and says to the priest, "Father, I'm 80 years old, widower, with 11 grandchildren. Last night I met two beautiful flight attendants. They took me home and I made love to both of them. Twice." The priest said: "Well, my son, when was the last time you were in confession?" "Never Father, I'm Jewish." "So then, why are you telling me?" "I'm telling everybody."
Recommender uses the registry mechanism from package registry to manage methods. This let’s the user easily specify and add new methods. The registry is called recommenderRegistry. See examples section.
data("MSWeb")
MSWeb10 <- sample(MSWeb[rowCounts(MSWeb) >10,], 100)
rec <- Recommender(MSWeb10, method = "POPULAR")
rec
## Recommender of type 'POPULAR' for 'binaryRatingMatrix'
## learned using 100 users.
str(getModel(rec))
## List of 1
## $ topN:Formal class 'topNList' [package "recommenderlab"] with 4 slots
## .. ..@ items :List of 1
## .. .. ..$ : int [1:285] 9 19 5 18 2 10 35 27 4 42 ...
## .. ..@ ratings : NULL
## .. ..@ itemLabels: chr [1:285] "regwiz" "Support Desktop" "End User Produced View" "Knowledge Base" ...
## .. ..@ n : int 285
## look at registry and a few methods
recommenderRegistry$get_entry_names()
## [1] "AR_binaryRatingMatrix" "IBCF_binaryRatingMatrix"
## [3] "IBCF_realRatingMatrix" "POPULAR_binaryRatingMatrix"
## [5] "POPULAR_realRatingMatrix" "RANDOM_realRatingMatrix"
## [7] "RANDOM_binaryRatingMatrix" "SVD_realRatingMatrix"
## [9] "SVDF_realRatingMatrix" "UBCF_binaryRatingMatrix"
## [11] "UBCF_realRatingMatrix"
recommenderRegistry$get_entry("POPULAR", dataType = "binaryRatingMatrix")
## Recommender method: POPULAR
## Description: Recommender based on item popularity (binary data).
## Parameters: None
recommenderRegistry$get_entry("SVD", dataType = "realRatingMatrix")
## Recommender method: SVD
## Description: Recommender based on SVD approximation with column-mean imputation (real data).
## Parameters:
## k maxiter normalize minRating
## 1 10 100 center NA