Collaborative filtering (CF) uses given rating data by many users for many items as the basis for predicting missing ratings and/or for creating a top-N recommendation list for a given user, called the active user.
In this project, I’m going to create two recommender algorithms, a user-based collaborative filtering recommender algorithm, and an item-based collaborative filtering recommender algorithm. We’ll use these to predict a user’s rating for a given movie.
The data I’ll be using is the Movielense 100k data set. This data set is compromised of 100,000 records containing user ratings of movies on a scale of 1-5 collected during a seven month period from 9/19/1997-4/22/1998. The data contains 943 users.
Data Import, Exploration, Transformation
#load library
library("recommenderlab")
## Warning: package 'recommenderlab' was built under R version 3.3.3
## Loading required package: Matrix
## Loading required package: arules
## Warning: package 'arules' was built under R version 3.3.3
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
## Warning: package 'proxy' was built under R version 3.3.3
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
data(MovieLense)
head(MovieLense)
## 1 x 1664 rating matrix of class 'realRatingMatrix' with 271 ratings.
ml <- as(MovieLense,"realRatingMatrix")
ml_n <- normalize(ml)
image(ml, main = "Raw Movielense Data")

image(ml_n, main = "Normalized Movielens Data")

image(MovieLense[1:100,1:100])

#Create the user-item matrix
ml_matrix <- as(MovieLense,"realRatingMatrix")
ml_matrix
## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.
head(ml_matrix)
## 1 x 1664 rating matrix of class 'realRatingMatrix' with 271 ratings.
#Normalize by subtracting the row mean from all ratings in the row
ml_n <- normalize(ml_matrix)
#binarize
ml_b <- binarize(ml_matrix, minRating = 0.001)
#Create a random sample of 500 users
set.seed(1234)
ml.samp <- sample(MovieLense, 500)
ml.samp
## 500 x 1664 rating matrix of class 'realRatingMatrix' with 53136 ratings.
#show all of the ratings for one user
rowCounts(ml.samp[1,])
## 108
## 33
as(ml.samp[1,], "list")
## $`108`
## Toy Story (1995)
## 4
## Twelve Monkeys (1995)
## 5
## Richard III (1995)
## 5
## Mighty Aphrodite (1995)
## 3
## Postino, Il (1994)
## 5
## Muppet Treasure Island (1996)
## 3
## Star Wars (1977)
## 4
## Fargo (1996)
## 4
## Independence Day (ID4) (1996)
## 3
## Lone Star (1996)
## 4
## Phenomenon (1996)
## 3
## Godfather, The (1972)
## 4
## Big Night (1996)
## 5
## Return of the Jedi (1983)
## 3
## Star Trek: First Contact (1996)
## 2
## Jerry Maguire (1996)
## 3
## Lost World: Jurassic Park, The (1997)
## 3
## My Best Friend's Wedding (1997)
## 2
## Sense and Sensibility (1995)
## 5
## River Wild, The (1994)
## 4
## Time to Kill, A (1996)
## 3
## Tin Cup (1996)
## 3
## Fierce Creatures (1997)
## 4
## Liar Liar (1997)
## 4
## Fly Away Home (1996)
## 3
## Everyone Says I Love You (1996)
## 5
## Mission: Impossible (1996)
## 3
## Courage Under Fire (1996)
## 2
## Boot, Das (1981)
## 5
## In the Bleak Midwinter (1995)
## 4
## Jane Eyre (1996)
## 3
## Saint, The (1997)
## 3
## Island of Dr. Moreau, The (1996)
## 2
#show rating averages for the first 6 users in the randomized sample
#this data set is enormous so we'll just show the head
head(rowMeans(ml.samp))
## 108 587 574 586 809 601
## 3.575758 2.967742 3.500000 3.325301 3.450000 3.081081
#shows how many movies each user rated in the full data set
#this data set is enormous so we'll just show the head
head(rowCounts(MovieLense))
## 1 2 3 4 5 6
## 271 61 51 23 175 208
hist(rowCounts(MovieLense), main = "Number of Ratings per User")

#shows how many times each movie was rated in the full data set
#this data set is enormous so we'll just show the head
head(colCounts(MovieLense))
## Toy Story (1995)
## 452
## GoldenEye (1995)
## 131
## Four Rooms (1995)
## 90
## Get Shorty (1995)
## 209
## Copycat (1995)
## 86
## Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
## 26
hist(colCounts(MovieLense), main = "Number of Ratings per Movie")

#shows the average rating per user in the full data set
#this data set is enormous so we'll just show the head
head(rowMeans(MovieLense))
## 1 2 3 4 5 6
## 3.605166 3.704918 2.764706 4.304348 2.874286 3.639423
#shows the average rating per movie in the full data set
#this data set is enormous so we'll just show the head
head(colMeans(MovieLense))
## Toy Story (1995)
## 3.878319
## GoldenEye (1995)
## 3.206107
## Four Rooms (1995)
## 3.033333
## Get Shorty (1995)
## 3.550239
## Copycat (1995)
## 3.302326
## Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
## 3.576923
Building and testing the POPULAR recommendation algorithm
############################# POPULAR ##################################
#Build a popularity recommender using the first 1,000 users
r <- Recommender(MovieLense[1:800], method = "POPULAR")
r
## Recommender of type 'POPULAR' for 'realRatingMatrix'
## learned using 800 users.
getModel(r)$topN
## Recommendations as 'topNList' with n = 1664 for 1 users.
#Recommend 3 movies to 2 users
recom <- predict(r, MovieLense[809:810], n = 3)
as(recom, "list")
## $`809`
## [1] "Star Wars (1977)" "Fargo (1996)" "Godfather, The (1972)"
##
## $`810`
## [1] "Star Wars (1977)" "Fargo (1996)" "Godfather, The (1972)"
#gives us predicted ratings for these users
recom <- predict(r, MovieLense[809:810], type = "ratings")
as(recom, "matrix")[,1:10]
## Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995)
## 809 3.741677 3.175188 3.100275 3.427367
## 810 4.731677 4.165188 4.090275 4.417367
## Copycat (1995) Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
## 809 3.221336 3.6587
## 810 4.211336 4.6487
## Twelve Monkeys (1995) Babe (1995) Dead Man Walking (1995)
## 809 3.658617 3.834488 3.762438
## 810 4.648617 4.824488 4.752438
## Richard III (1995)
## 809 3.73027
## 810 4.72027
#if we want to include the original ratings for the users as well:
recom <- predict(r, MovieLense[809:810], type = "ratingMatrix")
as(recom, "matrix")[,1:10]
## Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995)
## 809 3.741677 3.175188 3.100275 3.427367
## 810 4.731677 4.165188 4.090275 4.417367
## Copycat (1995) Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
## 809 3.221336 3.6587
## 810 4.211336 4.6487
## Twelve Monkeys (1995) Babe (1995) Dead Man Walking (1995)
## 809 3.658617 3.834488 3.762438
## 810 4.648617 4.824488 4.752438
## Richard III (1995)
## 809 3.73027
## 810 4.72027
ml.cross <- evaluationScheme(MovieLense[1:800], method = "cross", k = 4, given = 3, goodRating = 5)
ml.cross
## Evaluation scheme with 3 items given
## Method: 'cross-validation' with 4 run(s).
## Good ratings: >=5.000000
## Data set: 800 x 1664 rating matrix of class 'realRatingMatrix' with 85001 ratings.
ml.cross.results <- evaluate(ml.cross, method = "POPULAR", type = "topNList", n = c(1, 3, 5, 10, 15, 20))
## POPULAR run fold/sample [model time/prediction time]
## 1 [0sec/0.47sec]
## 2 [0sec/0.55sec]
## 3 [0.02sec/0.54sec]
## 4 [0sec/0.48sec]
plot(ml.cross.results, annotate = TRUE, main = "ROC Curve for POPULAR Recommender Method")

plot(ml.cross.results, "prec/rec", annotate = TRUE, main = "Precision-Recall for POPULAR Recommender Method")

Building and testing the UBCF and IBCF recommendation algorithms
######################### UBCF and IBCF ############################
#divide the data into 90% training 10% test
div <- evaluationScheme(MovieLense[1:943], method="split", train = 0.9, given = 15, goodRating = 5)
div
## Evaluation scheme with 15 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.900
## Good ratings: >=5.000000
## Data set: 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.
#Create the user-based collaborative filtering recommender using the training data
r.ubcf <- Recommender(getData(div, "train"), "UBCF")
r.ubcf
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 848 users.
#Create the item-based collaborative filtering recommender using the training data
r.ibcf <- Recommender(getData(div, "train"), "IBCF")
r.ibcf
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 848 users.
#Compute predicted ratings for test data that is known using the UBCF algorithm
p.ubcf <- predict(r.ubcf, getData(div, "known"), type = "ratings")
p.ubcf
## 95 x 1664 rating matrix of class 'realRatingMatrix' with 156655 ratings.
#Compute predicted ratings for test data that is known using the IBCF algorithm
p.ibcf <- predict(r.ibcf, getData(div, "known"), type = "ratings")
p.ibcf
## 95 x 1664 rating matrix of class 'realRatingMatrix' with 19363 ratings.
#Calculate the error between training prediction and unknown test data
error <- rbind(
UBCF = calcPredictionAccuracy(p.ubcf, getData(div, "unknown")),
IBCF = calcPredictionAccuracy(p.ibcf, getData(div, "unknown")))
error
## RMSE MSE MAE
## UBCF 1.085484 1.178276 0.8644544
## IBCF 1.265857 1.602394 0.9299493
#another less clean way to code the same thing
error.ubcf<-calcPredictionAccuracy(p.ubcf, getData(div, "unknown"))
error.ibcf<-calcPredictionAccuracy(p.ibcf, getData(div, "unknown"))
error <- rbind(error.ubcf,error.ibcf)
rownames(error) <- c("UBCF","IBCF")
error
## RMSE MSE MAE
## UBCF 1.085484 1.178276 0.8644544
## IBCF 1.265857 1.602394 0.9299493
As you can see, user-based collaborative filtering does a better job of predicting ratings than item-based collaborative filtering. This can be seen in the lower prediction error value of 1.085484 versus the 1.178276 that was produced by the item-based collaborative filtering recommender.