Collaborative filtering (CF) uses given rating data by many users for many items as the basis for predicting missing ratings and/or for creating a top-N recommendation list for a given user, called the active user.
In this project, I’m going to create two recommender algorithms, a user-based collaborative filtering recommender algorithm, and an item-based collaborative filtering recommender algorithm. We’ll use these to predict a user’s rating for a given joke.
The data I’ll be using is the Jester Dataset created by Ken Goldberg at UC Berkeley. He set up a site where you’re prompted with a joke, and then you rate how funny you thought that joke was on a sliding, continuous scale of -10 to 10. After rating just a few jokes, Ken’s site then presents you with a list of jokes the model predicts you’ll find to be funny. The data we’ll be working with contains ratings by joke by user. Some values are blank, since not every user rated every joke.
Data Import, Exploration, Transformation
#load library
library("recommenderlab")
## Warning: package 'recommenderlab' was built under R version 3.3.3
## Loading required package: Matrix
## Loading required package: arules
## Warning: package 'arules' was built under R version 3.3.3
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
## Warning: package 'proxy' was built under R version 3.3.3
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
#load a small subset for manual calculations
jest_data <- read.csv("https://raw.githubusercontent.com/excelsiordata/DATA643/master/jester-subset.csv", )
#take a look
head(jest_data)
## userid j1 j2 j3 j4 j5 j6 j7 j8 j9 j10
## 1 u1 1 5 1 1 5 5 2 2 2 5
## 2 u2 NA 4 4 NA 4 1 5 1 5 4
## 3 u3 5 2 NA 1 3 3 4 4 NA 3
## 4 u4 2 NA 2 2 NA 2 NA 1 1 4
## 5 u5 NA 5 NA NA 4 4 1 2 1 3
## 6 u6 5 3 5 2 NA NA 1 NA 4 NA
#convert NAs to 0
jest_data[is.na(jest_data)] <- 0
#take a look again and make sure that worked
head(jest_data)
## userid j1 j2 j3 j4 j5 j6 j7 j8 j9 j10
## 1 u1 1 5 1 1 5 5 2 2 2 5
## 2 u2 0 4 4 0 4 1 5 1 5 4
## 3 u3 5 2 0 1 3 3 4 4 0 3
## 4 u4 2 0 2 2 0 2 0 1 1 4
## 5 u5 0 5 0 0 4 4 1 2 1 3
## 6 u6 5 3 5 2 0 0 1 0 4 0
j_d_m <- as(jest_data,"realRatingMatrix")
j_d_m_n <- normalize(j_d_m)
image(j_d_m, main = "Raw Subset")

image(j_d_m_n, main = "Normalized Subset")

##########################Begin work on the full data set##########################
#connect to the data
data(Jester5k)
#Create the user-item matrix
jest_matrix <- as(Jester5k,"realRatingMatrix")
jest_matrix
## 5000 x 100 rating matrix of class 'realRatingMatrix' with 362106 ratings.
head(jest_matrix)
## 1 x 100 rating matrix of class 'realRatingMatrix' with 81 ratings.
#Normalize by subtracting the row mean from all ratings in the row
jest_n <- normalize(jest_matrix)
#binarize
jest_b <- binarize(jest_matrix, minRating = 0.001)
#Create a random sample of 1000 users
set.seed(1234)
j.samp <- sample(Jester5k, 1000)
j.samp
## 1000 x 100 rating matrix of class 'realRatingMatrix' with 71682 ratings.
#show all of the ratings for one user
rowCounts(j.samp[1,])
## u5092
## 70
as(j.samp[1,], "list")
## $u5092
## j1 j2 j3 j4 j5 j6 j7 j8 j9 j10 j11 j12
## -3.40 -3.40 -1.99 1.31 -4.03 -0.15 -4.22 -4.56 -4.37 -0.34 0.53 -5.87
## j13 j14 j15 j16 j17 j18 j19 j20 j21 j22 j23 j24
## -2.57 -5.39 0.44 -3.98 -3.35 -4.03 -4.03 -0.29 -3.79 -3.59 0.05 -6.94
## j25 j26 j27 j28 j29 j30 j31 j32 j33 j34 j35 j36
## -0.53 -2.57 1.80 -5.87 2.14 -4.42 -1.70 0.68 -7.82 -0.49 -0.92 2.38
## j37 j38 j39 j40 j41 j42 j43 j44 j45 j46 j47 j48
## -7.96 -5.15 -5.87 -7.23 -4.37 -2.33 -5.15 0.34 3.40 -1.60 -2.04 -3.11
## j49 j50 j51 j52 j53 j54 j55 j56 j57 j58 j59 j60
## 2.82 2.96 -3.35 1.07 0.10 -6.99 0.78 -2.57 -6.75 -6.60 -4.51 1.46
## j61 j62 j63 j64 j65 j66 j67 j68 j69 j70
## -1.55 1.12 -7.48 -7.96 2.52 0.49 -3.25 -1.21 2.18 -5.87
#show rating averages for the first 6 users in the randomized sample
head(rowMeans(j.samp))
## u5092 u21106 u10043 u9518 u23200 u21929
## -2.413429 -1.403151 3.811216 1.186300 0.078500 2.672055
#shows how many jokes each user rated in the full data set
head(rowCounts(Jester5k))
## u2841 u15547 u15221 u15573 u21505 u15994
## 81 71 100 100 72 100
#shows how many times each joke was rated in the full data set
colCounts(Jester5k)
## j1 j2 j3 j4 j5 j6 j7 j8 j9 j10 j11 j12 j13 j14 j15
## 3314 3648 3338 3142 4998 4073 4999 5000 3173 4057 4353 4478 5000 4494 5000
## j16 j17 j18 j19 j20 j21 j22 j23 j24 j25 j26 j27 j28 j29 j30
## 4999 5000 5000 5000 5000 4983 4282 4025 3194 4172 4764 4981 4781 4992 3616
## j31 j32 j33 j34 j35 j36 j37 j38 j39 j40 j41 j42 j43 j44 j45
## 4937 4993 3366 4334 4994 4998 3379 4575 4611 4442 3764 4910 3515 3248 4276
## j46 j47 j48 j49 j50 j51 j52 j53 j54 j55 j56 j57 j58 j59 j60
## 4722 4463 4964 4995 4999 3803 4017 4997 4933 3972 4954 3223 3135 3651 3597
## j61 j62 j63 j64 j65 j66 j67 j68 j69 j70 j71 j72 j73 j74 j75
## 4964 4992 4047 3474 4951 4989 3532 4989 4987 4066 1706 1740 1699 1726 1751
## j76 j77 j78 j79 j80 j81 j82 j83 j84 j85 j86 j87 j88 j89 j90
## 1745 1758 1750 1799 1760 1819 1784 1835 1854 1864 1860 1872 1917 1901 1946
## j91 j92 j93 j94 j95 j96 j97 j98 j99 j100
## 1931 1958 2002 2026 2047 2076 2088 2131 2179 1968
#shows the average rating per user in the full data set
head(rowMeans(Jester5k))
## u2841 u15547 u15221 u15573 u21505 u15994
## 3.855185 -2.781972 1.394100 -1.427300 1.156806 -3.380500
#shows the average rating per joke in the full data set
head(colMeans(Jester5k))
## j1 j2 j3 j4 j5 j6
## 0.9186301 0.1912473 0.2437148 -1.4517250 0.3225630 1.6852909
Building and testing the POPULAR recommendation algorithm
############################# POPULAR ##################################
#Build a popularity recommender using the first 1,000 users
r <- Recommender(Jester5k[1:1000], method = "POPULAR")
r
## Recommender of type 'POPULAR' for 'realRatingMatrix'
## learned using 1000 users.
getModel(r)$topN
## Recommendations as 'topNList' with n = 100 for 1 users.
#Recommend 3 jokes to 2 users
recom <- predict(r, Jester5k[1005:1006], n = 3)
as(recom, "list")
## $u18250
## [1] "j89" "j72" "j93"
##
## $u9344
## [1] "j89" "j6" "j72"
#gives us predicted ratings for these users
recom <- predict(r, Jester5k[1006:1007], type = "ratings")
as(recom, "matrix")[,1:10]
## j1 j2 j3 j4 j5 j6 j7 j8 j9
## u9344 3.975461 3.356674 3.112906 1.304081 NA 4.588913 NA NA 2.245925
## u10946 NA NA NA NA NA NA NA NA NA
## j10
## u9344 4.156132
## u10946 NA
#if we want to include the original ratings for the users as well:
recom <- predict(r, Jester5k[1006:1007], type = "ratingMatrix")
as(recom, "matrix")[,1:10]
## j1 j2 j3 j4 j5 j6
## u9344 3.975461 3.356674 3.112906 1.304081 -2.708367 4.588913
## u10946 1.347600 -3.602400 -5.442400 -3.402400 1.307600 6.687600
## j7 j8 j9 j10
## u9344 0.06163265 -5.088367 2.245925 4.156132
## u10946 11.05760000 -3.502400 -7.972400 -3.452400
j.cross <- evaluationScheme(Jester5k[1:1000], method = "cross", k = 4, given = 3, goodRating = 5)
j.cross
## Evaluation scheme with 3 items given
## Method: 'cross-validation' with 4 run(s).
## Good ratings: >=5.000000
## Data set: 1000 x 100 rating matrix of class 'realRatingMatrix' with 72358 ratings.
j.cross.results <- evaluate(j.cross, method = "POPULAR", type = "topNList", n = c(1, 3, 5, 10, 15, 20))
## POPULAR run fold/sample [model time/prediction time]
## 1 [0sec/0.3sec]
## 2 [0.02sec/0.28sec]
## 3 [0sec/0.28sec]
## 4 [0.01sec/0.27sec]
plot(j.cross.results, annotate = TRUE, main = "ROC Curve for POPULAR Recommender Method")

plot(j.cross.results, "prec/rec", annotate = TRUE, main = "Precision-Recall for POPULAR Recommender Method")

Building and testing the UBCF and IBCF recommendation algorithms
######################### UBCF and IBCF ############################
#divide the data into 90% training 10% test
div <- evaluationScheme(Jester5k[1:1000], method="split", train = 0.9, given = 15, goodRating = 5)
div
## Evaluation scheme with 15 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.900
## Good ratings: >=5.000000
## Data set: 1000 x 100 rating matrix of class 'realRatingMatrix' with 72358 ratings.
#Create the user-based collaborative filtering recommender using the training data
r.ubcf <- Recommender(getData(div, "train"), "UBCF")
r.ubcf
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 900 users.
#Create the item-based collaborative filtering recommender using the training data
r.ibcf <- Recommender(getData(div, "train"), "IBCF")
r.ibcf
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 900 users.
#Compute predicted ratings for test data that is known using the UBCF algorithm
p.ubcf <- predict(r.ubcf, getData(div, "known"), type = "ratings")
p.ubcf
## 100 x 100 rating matrix of class 'realRatingMatrix' with 8500 ratings.
#Compute predicted ratings for test data that is known using the IBCF algorithm
p.ibcf <- predict(r.ibcf, getData(div, "known"), type = "ratings")
p.ibcf
## 100 x 100 rating matrix of class 'realRatingMatrix' with 8437 ratings.
#Calculate the error between training prediction and unknown test data
error <- rbind(
UBCF = calcPredictionAccuracy(p.ubcf, getData(div, "unknown")),
IBCF = calcPredictionAccuracy(p.ibcf, getData(div, "unknown")))
error
## RMSE MSE MAE
## UBCF 4.600006 21.16006 3.644863
## IBCF 5.170880 26.73800 4.067278
#another less clean way to code the same thing
error.ubcf<-calcPredictionAccuracy(p.ubcf, getData(div, "unknown"))
error.ibcf<-calcPredictionAccuracy(p.ibcf, getData(div, "unknown"))
error <- rbind(error.ubcf,error.ibcf)
rownames(error) <- c("UBCF","IBCF")
error
## RMSE MSE MAE
## UBCF 4.600006 21.16006 3.644863
## IBCF 5.170880 26.73800 4.067278
As you can see, user-based collaborative filtering does a better job of predicting ratings than item-based collaborative filtering. This can be seen in the lower prediction error value.