Collaborative filtering (CF) uses given rating data by many users for many items as the basis for predicting missing ratings and/or for creating a top-N recommendation list for a given user, called the active user.

In this project, I’m going to create two recommender algorithms, a user-based collaborative filtering recommender algorithm, and an item-based collaborative filtering recommender algorithm. We’ll use these to predict a user’s rating for a given joke.

The data I’ll be using is the Jester Dataset created by Ken Goldberg at UC Berkeley. He set up a site where you’re prompted with a joke, and then you rate how funny you thought that joke was on a sliding, continuous scale of -10 to 10. After rating just a few jokes, Ken’s site then presents you with a list of jokes the model predicts you’ll find to be funny. The data we’ll be working with contains ratings by joke by user. Some values are blank, since not every user rated every joke.

Data Import, Exploration, Transformation

#load library
library("recommenderlab")
## Warning: package 'recommenderlab' was built under R version 3.3.3
## Loading required package: Matrix
## Loading required package: arules
## Warning: package 'arules' was built under R version 3.3.3
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
## Loading required package: proxy
## Warning: package 'proxy' was built under R version 3.3.3
## 
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
## 
##     as.matrix
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## The following object is masked from 'package:base':
## 
##     as.matrix
## Loading required package: registry
#load a small subset for manual calculations
jest_data <- read.csv("https://raw.githubusercontent.com/excelsiordata/DATA643/master/jester-subset.csv", )
#take a look
head(jest_data)
##   userid j1 j2 j3 j4 j5 j6 j7 j8 j9 j10
## 1     u1  1  5  1  1  5  5  2  2  2   5
## 2     u2 NA  4  4 NA  4  1  5  1  5   4
## 3     u3  5  2 NA  1  3  3  4  4 NA   3
## 4     u4  2 NA  2  2 NA  2 NA  1  1   4
## 5     u5 NA  5 NA NA  4  4  1  2  1   3
## 6     u6  5  3  5  2 NA NA  1 NA  4  NA
#convert NAs to 0
jest_data[is.na(jest_data)] <- 0
#take a look again and make sure that worked
head(jest_data)
##   userid j1 j2 j3 j4 j5 j6 j7 j8 j9 j10
## 1     u1  1  5  1  1  5  5  2  2  2   5
## 2     u2  0  4  4  0  4  1  5  1  5   4
## 3     u3  5  2  0  1  3  3  4  4  0   3
## 4     u4  2  0  2  2  0  2  0  1  1   4
## 5     u5  0  5  0  0  4  4  1  2  1   3
## 6     u6  5  3  5  2  0  0  1  0  4   0
j_d_m <- as(jest_data,"realRatingMatrix")
j_d_m_n <- normalize(j_d_m)

image(j_d_m, main = "Raw Subset")

image(j_d_m_n, main = "Normalized Subset")

##########################Begin work on the full data set##########################
#connect to the data
data(Jester5k)

#Create the user-item matrix
jest_matrix <- as(Jester5k,"realRatingMatrix")
jest_matrix
## 5000 x 100 rating matrix of class 'realRatingMatrix' with 362106 ratings.
head(jest_matrix)
## 1 x 100 rating matrix of class 'realRatingMatrix' with 81 ratings.
#Normalize by subtracting the row mean from all ratings in the row
jest_n <- normalize(jest_matrix)

#binarize
jest_b <- binarize(jest_matrix, minRating = 0.001)

#Create a random sample of 1000 users
set.seed(1234)
j.samp <- sample(Jester5k, 1000)
j.samp
## 1000 x 100 rating matrix of class 'realRatingMatrix' with 71682 ratings.
#show all of the ratings for one user
rowCounts(j.samp[1,])
## u5092 
##    70
as(j.samp[1,], "list")
## $u5092
##    j1    j2    j3    j4    j5    j6    j7    j8    j9   j10   j11   j12 
## -3.40 -3.40 -1.99  1.31 -4.03 -0.15 -4.22 -4.56 -4.37 -0.34  0.53 -5.87 
##   j13   j14   j15   j16   j17   j18   j19   j20   j21   j22   j23   j24 
## -2.57 -5.39  0.44 -3.98 -3.35 -4.03 -4.03 -0.29 -3.79 -3.59  0.05 -6.94 
##   j25   j26   j27   j28   j29   j30   j31   j32   j33   j34   j35   j36 
## -0.53 -2.57  1.80 -5.87  2.14 -4.42 -1.70  0.68 -7.82 -0.49 -0.92  2.38 
##   j37   j38   j39   j40   j41   j42   j43   j44   j45   j46   j47   j48 
## -7.96 -5.15 -5.87 -7.23 -4.37 -2.33 -5.15  0.34  3.40 -1.60 -2.04 -3.11 
##   j49   j50   j51   j52   j53   j54   j55   j56   j57   j58   j59   j60 
##  2.82  2.96 -3.35  1.07  0.10 -6.99  0.78 -2.57 -6.75 -6.60 -4.51  1.46 
##   j61   j62   j63   j64   j65   j66   j67   j68   j69   j70 
## -1.55  1.12 -7.48 -7.96  2.52  0.49 -3.25 -1.21  2.18 -5.87
#show rating averages for the first 6 users in the randomized sample
head(rowMeans(j.samp))
##     u5092    u21106    u10043     u9518    u23200    u21929 
## -2.413429 -1.403151  3.811216  1.186300  0.078500  2.672055
#shows how many jokes each user rated in the full data set
head(rowCounts(Jester5k))
##  u2841 u15547 u15221 u15573 u21505 u15994 
##     81     71    100    100     72    100
#shows how many times each joke was rated in the full data set
colCounts(Jester5k)
##   j1   j2   j3   j4   j5   j6   j7   j8   j9  j10  j11  j12  j13  j14  j15 
## 3314 3648 3338 3142 4998 4073 4999 5000 3173 4057 4353 4478 5000 4494 5000 
##  j16  j17  j18  j19  j20  j21  j22  j23  j24  j25  j26  j27  j28  j29  j30 
## 4999 5000 5000 5000 5000 4983 4282 4025 3194 4172 4764 4981 4781 4992 3616 
##  j31  j32  j33  j34  j35  j36  j37  j38  j39  j40  j41  j42  j43  j44  j45 
## 4937 4993 3366 4334 4994 4998 3379 4575 4611 4442 3764 4910 3515 3248 4276 
##  j46  j47  j48  j49  j50  j51  j52  j53  j54  j55  j56  j57  j58  j59  j60 
## 4722 4463 4964 4995 4999 3803 4017 4997 4933 3972 4954 3223 3135 3651 3597 
##  j61  j62  j63  j64  j65  j66  j67  j68  j69  j70  j71  j72  j73  j74  j75 
## 4964 4992 4047 3474 4951 4989 3532 4989 4987 4066 1706 1740 1699 1726 1751 
##  j76  j77  j78  j79  j80  j81  j82  j83  j84  j85  j86  j87  j88  j89  j90 
## 1745 1758 1750 1799 1760 1819 1784 1835 1854 1864 1860 1872 1917 1901 1946 
##  j91  j92  j93  j94  j95  j96  j97  j98  j99 j100 
## 1931 1958 2002 2026 2047 2076 2088 2131 2179 1968
#shows the average rating per user in the full data set
head(rowMeans(Jester5k))
##     u2841    u15547    u15221    u15573    u21505    u15994 
##  3.855185 -2.781972  1.394100 -1.427300  1.156806 -3.380500
#shows the average rating per joke in the full data set
head(colMeans(Jester5k))
##         j1         j2         j3         j4         j5         j6 
##  0.9186301  0.1912473  0.2437148 -1.4517250  0.3225630  1.6852909

Building and testing the UBCF and IBCF recommendation algorithms

######################### UBCF and IBCF ############################

#divide the data into 90% training 10% test
div <- evaluationScheme(Jester5k[1:1000], method="split", train = 0.9, given = 15, goodRating = 5)
div
## Evaluation scheme with 15 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.900
## Good ratings: >=5.000000
## Data set: 1000 x 100 rating matrix of class 'realRatingMatrix' with 72358 ratings.
#Create the user-based collaborative filtering recommender using the training data
r.ubcf <- Recommender(getData(div, "train"), "UBCF")
r.ubcf
## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 900 users.
#Create the item-based collaborative filtering recommender using the training data
r.ibcf <- Recommender(getData(div, "train"), "IBCF")
r.ibcf
## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 900 users.
#Compute predicted ratings for test data that is known using the UBCF algorithm
p.ubcf <- predict(r.ubcf, getData(div, "known"), type = "ratings")
p.ubcf
## 100 x 100 rating matrix of class 'realRatingMatrix' with 8500 ratings.
#Compute predicted ratings for test data that is known using the IBCF algorithm
p.ibcf <- predict(r.ibcf, getData(div, "known"), type = "ratings")
p.ibcf
## 100 x 100 rating matrix of class 'realRatingMatrix' with 8437 ratings.
#Calculate the error between training prediction and unknown test data
error <- rbind(
  UBCF = calcPredictionAccuracy(p.ubcf, getData(div, "unknown")),
  IBCF = calcPredictionAccuracy(p.ibcf, getData(div, "unknown")))
error
##          RMSE      MSE      MAE
## UBCF 4.600006 21.16006 3.644863
## IBCF 5.170880 26.73800 4.067278
#another less clean way to code the same thing
error.ubcf<-calcPredictionAccuracy(p.ubcf, getData(div, "unknown"))
error.ibcf<-calcPredictionAccuracy(p.ibcf, getData(div, "unknown"))
error <- rbind(error.ubcf,error.ibcf)
rownames(error) <- c("UBCF","IBCF")
error
##          RMSE      MSE      MAE
## UBCF 4.600006 21.16006 3.644863
## IBCF 5.170880 26.73800 4.067278

As you can see, user-based collaborative filtering does a better job of predicting ratings than item-based collaborative filtering. This can be seen in the lower prediction error value.