Collaborative filtering (CF) uses given rating data by many users for many items as the basis for predicting missing ratings and/or for creating a top-N recommendation list for a given user, called the active user.

In this project, I’m going to create two recommender algorithms, a user-based collaborative filtering recommender algorithm, and an item-based collaborative filtering recommender algorithm. We’ll use these to predict a user’s rating for a given movie.

The data I’ll be using is the Movielense 100k data set. This data set is compromised of 100,000 records containing user ratings of movies on a scale of 1-5 collected during a seven month period from 9/19/1997-4/22/1998. The data contains 943 users.

Data Import, Exploration, Transformation

#load library
library("recommenderlab")
## Warning: package 'recommenderlab' was built under R version 3.3.3
## Loading required package: Matrix
## Loading required package: arules
## Warning: package 'arules' was built under R version 3.3.3
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
## Loading required package: proxy
## Warning: package 'proxy' was built under R version 3.3.3
## 
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
## 
##     as.matrix
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## The following object is masked from 'package:base':
## 
##     as.matrix
## Loading required package: registry
data(MovieLense)
head(MovieLense)
## 1 x 1664 rating matrix of class 'realRatingMatrix' with 271 ratings.
ml <- as(MovieLense,"realRatingMatrix")
ml_n <- normalize(ml)

image(ml, main = "Raw Movielense Data")

image(ml_n, main = "Normalized Movielens Data")

image(MovieLense[1:100,1:100])

#Create the user-item matrix
ml_matrix <- as(MovieLense,"realRatingMatrix")
ml_matrix
## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.
head(ml_matrix)
## 1 x 1664 rating matrix of class 'realRatingMatrix' with 271 ratings.
#Normalize by subtracting the row mean from all ratings in the row
ml_n <- normalize(ml_matrix)

#binarize
ml_b <- binarize(ml_matrix, minRating = 0.001)

#Create a random sample of 500 users
set.seed(1234)
ml.samp <- sample(MovieLense, 500)
ml.samp
## 500 x 1664 rating matrix of class 'realRatingMatrix' with 53136 ratings.
#show all of the ratings for one user
rowCounts(ml.samp[1,])
## 108 
##  33
as(ml.samp[1,], "list")
## $`108`
##                      Toy Story (1995) 
##                                     4 
##                 Twelve Monkeys (1995) 
##                                     5 
##                    Richard III (1995) 
##                                     5 
##               Mighty Aphrodite (1995) 
##                                     3 
##                    Postino, Il (1994) 
##                                     5 
##         Muppet Treasure Island (1996) 
##                                     3 
##                      Star Wars (1977) 
##                                     4 
##                          Fargo (1996) 
##                                     4 
##         Independence Day (ID4) (1996) 
##                                     3 
##                      Lone Star (1996) 
##                                     4 
##                     Phenomenon (1996) 
##                                     3 
##                 Godfather, The (1972) 
##                                     4 
##                      Big Night (1996) 
##                                     5 
##             Return of the Jedi (1983) 
##                                     3 
##       Star Trek: First Contact (1996) 
##                                     2 
##                  Jerry Maguire (1996) 
##                                     3 
## Lost World: Jurassic Park, The (1997) 
##                                     3 
##       My Best Friend's Wedding (1997) 
##                                     2 
##          Sense and Sensibility (1995) 
##                                     5 
##                River Wild, The (1994) 
##                                     4 
##                Time to Kill, A (1996) 
##                                     3 
##                        Tin Cup (1996) 
##                                     3 
##               Fierce Creatures (1997) 
##                                     4 
##                      Liar Liar (1997) 
##                                     4 
##                  Fly Away Home (1996) 
##                                     3 
##       Everyone Says I Love You (1996) 
##                                     5 
##            Mission: Impossible (1996) 
##                                     3 
##             Courage Under Fire (1996) 
##                                     2 
##                      Boot, Das (1981) 
##                                     5 
##         In the Bleak Midwinter (1995) 
##                                     4 
##                      Jane Eyre (1996) 
##                                     3 
##                     Saint, The (1997) 
##                                     3 
##      Island of Dr. Moreau, The (1996) 
##                                     2
#show rating averages for the first 6 users in the randomized sample
#this data set is enormous so we'll just show the head
head(rowMeans(ml.samp))
##      108      587      574      586      809      601 
## 3.575758 2.967742 3.500000 3.325301 3.450000 3.081081
#shows how many movies each user rated in the full data set
#this data set is enormous so we'll just show the head
head(rowCounts(MovieLense))
##   1   2   3   4   5   6 
## 271  61  51  23 175 208
hist(rowCounts(MovieLense), main = "Number of Ratings per User")

#shows how many times each movie was rated in the full data set
#this data set is enormous so we'll just show the head
head(colCounts(MovieLense))
##                                     Toy Story (1995) 
##                                                  452 
##                                     GoldenEye (1995) 
##                                                  131 
##                                    Four Rooms (1995) 
##                                                   90 
##                                    Get Shorty (1995) 
##                                                  209 
##                                       Copycat (1995) 
##                                                   86 
## Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 
##                                                   26
hist(colCounts(MovieLense), main = "Number of Ratings per Movie")

#shows the average rating per user in the full data set
#this data set is enormous so we'll just show the head
head(rowMeans(MovieLense))
##        1        2        3        4        5        6 
## 3.605166 3.704918 2.764706 4.304348 2.874286 3.639423
#shows the average rating per movie in the full data set
#this data set is enormous so we'll just show the head
head(colMeans(MovieLense))
##                                     Toy Story (1995) 
##                                             3.878319 
##                                     GoldenEye (1995) 
##                                             3.206107 
##                                    Four Rooms (1995) 
##                                             3.033333 
##                                    Get Shorty (1995) 
##                                             3.550239 
##                                       Copycat (1995) 
##                                             3.302326 
## Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 
##                                             3.576923

Building and testing the UBCF and IBCF recommendation algorithms

######################### UBCF and IBCF ############################

#divide the data into 90% training 10% test
div <- evaluationScheme(MovieLense[1:943], method="split", train = 0.9, given = 15, goodRating = 5)
div
## Evaluation scheme with 15 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.900
## Good ratings: >=5.000000
## Data set: 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.
#Create the user-based collaborative filtering recommender using the training data
r.ubcf <- Recommender(getData(div, "train"), "UBCF")
r.ubcf
## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 848 users.
#Create the item-based collaborative filtering recommender using the training data
r.ibcf <- Recommender(getData(div, "train"), "IBCF")
r.ibcf
## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 848 users.
#Compute predicted ratings for test data that is known using the UBCF algorithm
p.ubcf <- predict(r.ubcf, getData(div, "known"), type = "ratings")
p.ubcf
## 95 x 1664 rating matrix of class 'realRatingMatrix' with 156655 ratings.
#Compute predicted ratings for test data that is known using the IBCF algorithm
p.ibcf <- predict(r.ibcf, getData(div, "known"), type = "ratings")
p.ibcf
## 95 x 1664 rating matrix of class 'realRatingMatrix' with 19363 ratings.
#Calculate the error between training prediction and unknown test data
error <- rbind(
  UBCF = calcPredictionAccuracy(p.ubcf, getData(div, "unknown")),
  IBCF = calcPredictionAccuracy(p.ibcf, getData(div, "unknown")))
error
##          RMSE      MSE       MAE
## UBCF 1.085484 1.178276 0.8644544
## IBCF 1.265857 1.602394 0.9299493
#another less clean way to code the same thing
error.ubcf<-calcPredictionAccuracy(p.ubcf, getData(div, "unknown"))
error.ibcf<-calcPredictionAccuracy(p.ibcf, getData(div, "unknown"))
error <- rbind(error.ubcf,error.ibcf)
rownames(error) <- c("UBCF","IBCF")
error
##          RMSE      MSE       MAE
## UBCF 1.085484 1.178276 0.8644544
## IBCF 1.265857 1.602394 0.9299493

As you can see, user-based collaborative filtering does a better job of predicting ratings than item-based collaborative filtering. This can be seen in the lower prediction error value of 1.085484 versus the 1.178276 that was produced by the item-based collaborative filtering recommender.