The purpose of the project is to implement recommendation algorithms for an existing dataset of user-item ratings. The data for the project is taken from MovieLens and downloaded from https://grouplens.org/datasets/movielens/. The dataset contains about 10000 users and 610 movies, which were rated by users on the scale from 1 to 5. User-based and Item-based collaborative filtering will be employed in this project.
“recommenderLab” package will be used as a core packge for building a recommender algorithms.
Reading data and making necessary transformations. We can use MovieLense data set from recommenderLab (quicker to test code) or use the full version from https://grouplens.org/datasets/movielens/.
# reading data (RecommenderLab contains the sample of MovieLense data set)
data(MovieLense)
movie_matrix<-MovieLense
dim(movie_matrix)
## [1] 943 1664
# alternative way to load MovieLense dataset from the link https://grouplens.org/datasets/movielens/
# reading data
# ratings = read.csv("/Users/Olga/Desktop/ml-latest-small/ratings.csv")
# transforming to a wide format
# data<-ratings%>% select (movieId, userId, rating) %>% spread (movieId,rating)
# converting the data set into a real rating matrix
# movie_matrix <- as(as.matrix(data[-c(1)]), "realRatingMatrix")
# looking at the matrix structure and ferst 5 rows of the real rating matrix
str(movie_matrix)
## Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
## ..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. ..@ i : int [1:99392] 0 1 4 5 9 12 14 15 16 17 ...
## .. .. ..@ p : int [1:1665] 0 452 583 673 882 968 994 1386 1605 1904 ...
## .. .. ..@ Dim : int [1:2] 943 1664
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : chr [1:943] "1" "2" "3" "4" ...
## .. .. .. ..$ : chr [1:1664] "Toy Story (1995)" "GoldenEye (1995)" "Four Rooms (1995)" "Get Shorty (1995)" ...
## .. .. ..@ x : num [1:99392] 5 4 4 4 4 3 1 5 4 5 ...
## .. .. ..@ factors : list()
## ..@ normalize: NULL
head(movie_matrix@data [1:5,1:5])
## 5 x 5 sparse Matrix of class "dgCMatrix"
## Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995)
## 1 5 3 4 3
## 2 4 . . .
## 3 . . . .
## 4 . . . .
## 5 4 3 . .
## Copycat (1995)
## 1 3
## 2 .
## 3 .
## 4 .
## 5 .
# looking at the rating provided by user 1 and 100 for the first 5 movies
movie_matrix@data[1,1:5]
## Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995)
## 5 3 4 3
## Copycat (1995)
## 3
movie_matrix@data[100, 1:5]
## Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995)
## 0 0 0 0
## Copycat (1995)
## 0
The matrix contains 1664 movies and 943 users. 99392 ratings in total (the matrix is sparse).
# looking at the number of movies the 100's user has rated
length(movie_matrix@data[100,][movie_matrix@data[100,] > 0])
## [1] 56
# checking total number of ratings given by the users
nratings(movie_matrix)
## [1] 99392
# overall rating distribution
hist(getRatings(movie_matrix), main = "Distribution Of Ratings", xlim=c(0,5), breaks="FD")
The most popular rating is 4.
# finding the most/least popular movies
ratings_binary<-binarize(movie_matrix, minRating = 1)
ratings_binary
## 943 x 1664 rating matrix of class 'binaryRatingMatrix' with 99392 ratings.
ratings_sum<-colSums(ratings_binary)
ratings_sum_df<- data.frame(movie = names(ratings_sum), pratings = ratings_sum)
head(ratings_sum_df[order(-ratings_sum_df$pratings), ],10)
## movie pratings
## Star Wars (1977) Star Wars (1977) 583
## Contact (1997) Contact (1997) 509
## Fargo (1996) Fargo (1996) 508
## Return of the Jedi (1983) Return of the Jedi (1983) 507
## Liar Liar (1997) Liar Liar (1997) 485
## English Patient, The (1996) English Patient, The (1996) 481
## Scream (1996) Scream (1996) 478
## Toy Story (1995) Toy Story (1995) 452
## Air Force One (1997) Air Force One (1997) 431
## Independence Day (ID4) (1996) Independence Day (ID4) (1996) 429
tail(ratings_sum_df[order(-ratings_sum_df$pratings), ],10)
## movie
## Further Gesture, A (1996) Further Gesture, A (1996)
## Mirage (1995) Mirage (1995)
## Mamma Roma (1962) Mamma Roma (1962)
## Sunchaser, The (1996) Sunchaser, The (1996)
## War at Home, The (1996) War at Home, The (1996)
## Sweet Nothing (1995) Sweet Nothing (1995)
## Mat' i syn (1997) Mat' i syn (1997)
## B. Monkey (1998) B. Monkey (1998)
## You So Crazy (1994) You So Crazy (1994)
## Scream of Stone (Schrei aus Stein) (1991) Scream of Stone (Schrei aus Stein) (1991)
## pratings
## Further Gesture, A (1996) 1
## Mirage (1995) 1
## Mamma Roma (1962) 1
## Sunchaser, The (1996) 1
## War at Home, The (1996) 1
## Sweet Nothing (1995) 1
## Mat' i syn (1997) 1
## B. Monkey (1998) 1
## You So Crazy (1994) 1
## Scream of Stone (Schrei aus Stein) (1991) 1
Two normalization techniques are going to be implemented:
data.norm.c<-normalize(movie_matrix, method="center")
data.norm.z<-normalize(movie_matrix, method="Z-score")
# ploting rating distribution for Raw, Normalized and Z-score Normalized Ratings
par(mfrow = c(3,1))
plot(density(getRatings(movie_matrix)),main = 'Raw')
plot(density(getRatings(data.norm.c)),main = 'Normalized')
plot(density(getRatings(data.norm.z)),main = 'Z-score')
par(mfrow = c(1,1))
From the plots we can see that using normalization techniques we have brought the data close to the normal distribution from it’s original distribution.
Similar users will have similar movie tastes
It is a memory based model as loads whole rating matrix into memory
User-based collaborative filtering is a two-step process: first step is the finding for a given user his neighbours (using similarity measures such as Pearson coefficient or Cosine distance). For item not rated by user, we use average rating of that item of user’s neighbours.
Now I am going to build user-based models employing two types of normalization techniques and two types of similarity measures: Pearson coefficient or Cosine distance.
Cross-validation scheme will be used to evaluate the models’ performance.
# creating evaluation scheme (5-fold CV; everything that is above 3 is considered a good rating; 5 neighbours will be find for a given user(item) to make recommendation)
set.seed(123)
es<- evaluationScheme(movie_matrix, method = "cross", train = 0.9, given = 5, goodRating = 3, k = 5)
# building a recommendation using raw data and Pearson coefficient as a similarity measure to find neighbours
param1 = list(normalize = NULL, method = "Pearson")
result_1<- evaluate(es, method = "UBCF", param = param1, type = "ratings")
## UBCF run fold/sample [model time/prediction time]
## 1 [0.008sec/0.88sec]
## 2 [0.001sec/0.847sec]
## 3 [0sec/0.686sec]
## 4 [0sec/0.655sec]
## 5 [0.001sec/0.83sec]
avg(result_1)
## RMSE MSE MAE
## res 2.620887 6.871261 2.346896
# building a recommendation using normalized data (centering) and Pearson coefficient as a similarity measure to find neighbours
param2 = list(normalize = "center", method = "Pearson")
result_2<-evaluate(es, method = "UBCF", param = param2, type = "ratings")
## UBCF run fold/sample [model time/prediction time]
## 1 [0.006sec/0.753sec]
## 2 [0.007sec/0.634sec]
## 3 [0.007sec/0.61sec]
## 4 [0.006sec/0.768sec]
## 5 [0.006sec/0.663sec]
avg(result_2)
## RMSE MSE MAE
## res 1.103812 1.218589 0.8736964
# building a recommendation using normalized data (Z-score) and Pearson coefficient as a similarity measure to find neighbours
param3 = list(normalize = "Z-score", method = "Pearson")
result_3<-evaluate(es, method = "UBCF", param = param3, type = "ratings")
## UBCF run fold/sample [model time/prediction time]
## 1 [0.035sec/0.637sec]
## 2 [0.039sec/0.627sec]
## 3 [0.037sec/0.843sec]
## 4 [0.19sec/0.642sec]
## 5 [0.035sec/0.657sec]
avg(result_3)
## RMSE MSE MAE
## res 1.103173 1.217176 0.8740004
# Cosine similarity
# building a recommendation using raw data and Cosine distance as a similarity measure to find neighbours
param4 = list(method = "Cosine")
result_4<- evaluate(es, method = "UBCF", param = param4, type = "ratings")
## UBCF run fold/sample [model time/prediction time]
## 1 [0.007sec/0.563sec]
## 2 [0.007sec/0.564sec]
## 3 [0.006sec/0.596sec]
## 4 [0.008sec/0.721sec]
## 5 [0.007sec/0.645sec]
avg(result_4)
## RMSE MSE MAE
## res 1.122067 1.259213 0.8906996
# building a recommendation using normalized data (centering) and Cosine distance as a similarity measure to find neighbours
param5 = list(normalize = "center", method = "Cosine")
result_5<-evaluate(es, method = "UBCF", param = param5, type = "ratings")
## UBCF run fold/sample [model time/prediction time]
## 1 [0.007sec/0.773sec]
## 2 [0.006sec/0.775sec]
## 3 [0.018sec/0.644sec]
## 4 [0.006sec/0.667sec]
## 5 [0.006sec/0.552sec]
avg(result_5)
## RMSE MSE MAE
## res 1.122067 1.259213 0.8906996
# building a recommendation using normalized data (Z-score) and Cosine distance as a similarity measure to find neighbours
param6 = list(normalize = "Z-score", method = "Cosine")
result_6<-evaluate(es, method = "UBCF", param = param6, type = "ratings")
## UBCF run fold/sample [model time/prediction time]
## 1 [0.035sec/0.689sec]
## 2 [0.035sec/0.575sec]
## 3 [0.038sec/0.549sec]
## 4 [0.035sec/0.681sec]
## 5 [0.034sec/0.567sec]
avg(result_6)
## RMSE MSE MAE
## res 1.121713 1.258406 0.8910533
Models’ performance is been summarized below:
m1<-cbind(RMSE=avg(result_1))
m2<-cbind(RMSE=avg(result_2))
m3<-cbind(RMSE=avg(result_3))
m4<-cbind(RMSE=avg(result_4))
m5<-cbind(RMSE=avg(result_5))
m6<-cbind(RMSE=avg(result_6))
summary = rbind(m1, m2, m3, m4, m5, m6)
rownames(summary) <- c("model_1","model_2", "model_3", "model_4", "model_5", "model_6")
summary
## RMSE MSE MAE
## model_1 2.620887 6.871261 2.3468962
## model_2 1.103812 1.218589 0.8736964
## model_3 1.103173 1.217176 0.8740004
## model_4 1.122067 1.259213 0.8906996
## model_5 1.122067 1.259213 0.8906996
## model_6 1.121713 1.258406 0.8910533
The best performed model in terms of lowest RMSE is model 3 which uses Person similarity measure and Z-score normalized data. Let’s look at the confusion matrix and ROC curve for that model for 5, 10 or 15 recommendations.
param3 = list(normalize = "Z-score", method = "Pearson")
result_3<-evaluate(es, method = "UBCF", param = param3, type = "topNList", n = c(5,10,15))
## UBCF run fold/sample [model time/prediction time]
## 1 [0.036sec/0.701sec]
## 2 [0.042sec/0.85sec]
## 3 [0.042sec/0.814sec]
## 4 [0.038sec/0.817sec]
## 5 [0.037sec/0.692sec]
avg(result_3)
## TP FP FN TN precision recall TPR
## 5 2.183246 2.712042 80.42723 1573.677 0.4459976 0.03836427 0.03836427
## 10 3.893194 5.897382 78.71728 1570.492 0.3977301 0.06401574 0.06401574
## 15 5.364398 9.321466 77.24607 1567.068 0.3653421 0.08320607 0.08320607
## FPR
## 5 0.001690402
## 10 0.003679641
## 15 0.005820184
plot(result_3, annotate = TRUE, main = "ROC curve (model 3)")
users will prefer those products similar to ones they have already rated
this method explorers the relationship between items
for each item top n items are stored (rather then storing all the items for an efficiency purposes) based on similarity measures (Cosine or Pearson). Weighted sum is used to finally make recommendation for user.
Now I am going to build item-based models employing two types of normalization techniques and two types of similarity measures: Pearson coefficient or Cosine distance.
Cross-validation scheme will be used to evaluate the models’ performance.
# building a recommendation using raw data and Pearson coefficient as a similarity measure to find neighbours
param7 = list(normalize = NULL, method = "Pearson")
result_7<-evaluate(es, method = "IBCF", param = param7, type = "ratings")
## IBCF run fold/sample [model time/prediction time]
## 1 [23.572sec/0.203sec]
## 2 [15.337sec/0.04sec]
## 3 [18.167sec/0.061sec]
## 4 [17.445sec/0.055sec]
## 5 [13.928sec/0.048sec]
avg(result_7)
## RMSE MSE MAE
## res 1.523747 2.326853 1.135707
# building a recommendation using normalized data (centering) and Pearson coefficient as a similarity measure to find neighbours
param8 = list(normalize = "center", method = "Pearson")
result_8<-evaluate(es, method = "IBCF", param = param8, type = "ratings")
## IBCF run fold/sample [model time/prediction time]
## 1 [13.748sec/0.036sec]
## 2 [14.094sec/0.055sec]
## 3 [13.914sec/0.039sec]
## 4 [14.81sec/0.038sec]
## 5 [14.353sec/0.05sec]
avg(result_8)
## RMSE MSE MAE
## res 1.467439 2.165082 1.07166
# building a recommendation using normalized data (Z-score) and Pearson coefficient as a similarity measure to find neighbours
param9 = list(normalize = "Z-score", method = "Pearson")
result_9<-evaluate(es, method = "IBCF", param = param9, type = "ratings")
## IBCF run fold/sample [model time/prediction time]
## 1 [13.696sec/0.042sec]
## 2 [15.121sec/0.063sec]
## 3 [13.876sec/0.072sec]
## 4 [15.392sec/0.066sec]
## 5 [14.332sec/0.048sec]
avg(result_9)
## RMSE MSE MAE
## res 1.50876 2.29199 1.110934
# Cosine similarity
# building a recommendation using raw data and Cosine similarity as a similarity measure to find neighbours
param10 = list(method = "Cosine")
result_10<- evaluate(es, method = "IBCF", param = param10, type = "ratings")
## IBCF run fold/sample [model time/prediction time]
## 1 [14.761sec/0.036sec]
## 2 [14.585sec/0.034sec]
## 3 [14.292sec/0.033sec]
## 4 [14.458sec/0.052sec]
## 5 [14.429sec/0.05sec]
avg(result_10)
## RMSE MSE MAE
## res 1.434383 2.101411 1.028052
# building a recommendation using normalized data (centering) and Cosine similarity as a similarity measure to find neighbours
param11 = list(normalize = "center", method = "Cosine")
result_11<-evaluate(es, method = "IBCF", param = param11, type = "ratings")
## IBCF run fold/sample [model time/prediction time]
## 1 [14.463sec/0.05sec]
## 2 [14.45sec/0.05sec]
## 3 [14.523sec/0.046sec]
## 4 [14.429sec/0.05sec]
## 5 [14.564sec/0.049sec]
avg(result_11)
## RMSE MSE MAE
## res 1.434383 2.101411 1.028052
# building a recommendation using normalized data (Z-score) and Cosine similarity as a similarity measure to find neighbours
param12 = list(normalize = "Z-score", method = "Cosine")
result_12<-evaluate(es, method = "IBCF", param = param12, type = "ratings")
## IBCF run fold/sample [model time/prediction time]
## 1 [14.587sec/0.059sec]
## 2 [14.209sec/0.057sec]
## 3 [15.315sec/0.065sec]
## 4 [15.483sec/0.055sec]
## 5 [15.016sec/0.055sec]
avg(result_12)
## RMSE MSE MAE
## res 1.471373 2.215995 1.055984
Models’ performance is been summarized below:
m7<-cbind(RMSE=avg(result_7))
m8<-cbind(RMSE=avg(result_8))
m9<-cbind(RMSE=avg(result_9))
m10<-cbind(RMSE=avg(result_10))
m11<-cbind(RMSE=avg(result_11))
m12<-cbind(RMSE=avg(result_12))
summary2 = rbind(m7, m8, m9, m10, m11, m12)
rownames(summary2) <- c("model_7","model_8", "model_9", "model_10", "model_11", "model_12")
summary2
## RMSE MSE MAE
## model_7 1.523747 2.326853 1.135707
## model_8 1.467439 2.165082 1.071660
## model_9 1.508760 2.291990 1.110934
## model_10 1.434383 2.101411 1.028052
## model_11 1.434383 2.101411 1.028052
## model_12 1.471373 2.215995 1.055984
The best model is model 10 and model 11 which was build using Cosine similarity measure.
Let’s look at the confusion matrix and ROC curve of model 10 for 5, 10 or 15 recommendations.
param10 = list(method = "Cosine")
result_10<- evaluate(es, method = "IBCF", param = param10, type = "topNList", n = c(5,10,15))
## IBCF run fold/sample [model time/prediction time]
## 1 [14.507sec/0.061sec]
## 2 [14.683sec/0.064sec]
## 3 [14.635sec/0.071sec]
## 4 [14.955sec/0.065sec]
## 5 [15.055sec/0.048sec]
avg(result_10)
## TP FP FN TN precision recall
## 5 0.07643979 4.765445 82.53403 1571.624 0.01578546 0.0007658662
## 10 0.13507853 9.153927 82.47539 1567.236 0.01432130 0.0014199840
## 15 0.17905759 12.734031 82.43141 1563.655 0.01332812 0.0018856694
## TPR FPR
## 5 0.0007658662 0.003028092
## 10 0.0014199840 0.005817471
## 15 0.0018856694 0.008093226
plot(result_10, annotate = TRUE, main = "ROC curve (model 10)")
As we see model_10 performs slightly worse that the best model (model_3) of user-based approach.
In general, the user-based models performed slightly better than item-based models, but this approach requires more memory.
Let’s build the complete model (user-based recommendation model using Pearson coefficient) and make recommendations.
# splitting data on train and test sets
esf<- evaluationScheme(movie_matrix, method = "split", train = 0.9, given = 5, goodRating = 3)
train <-getData(esf, "train")
test <-getData(esf, "unknown")
test_known <- getData(esf, "known")
# building user-based recommendation model
param_f<- list (method = "Pearson", nn=10)
final_model <- Recommender(train, method = "UBCF", param = param_f)
final_model
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 848 users.
# getting recommendations (top 10)
final_prediction<- predict (final_model, test, n = 10, type = "topNList")
final_prediction@items[1]
## $`14`
## [1] 285 338 303 299 676 270 305 309 342 890
final_prediction@ratings[1]
## $`14`
## [1] 4.844350 4.683689 4.427757 4.384879 4.377291 4.359480 4.345418
## [8] 4.320462 4.320462 4.320462