Project

Introduction

The purpose of the project is to implement recommendation algorithms for an existing dataset of user-item ratings. The data for the project is taken from MovieLens and downloaded from https://grouplens.org/datasets/movielens/. The dataset contains about 10000 users and 610 movies, which were rated by users on the scale from 1 to 5. User-based and Item-based collaborative filtering will be employed in this project.

“recommenderLab” package will be used as a core packge for building a recommender algorithms.

Data Exploration

Reading data and making necessary transformations. We can use MovieLense data set from recommenderLab (quicker to test code) or use the full version from https://grouplens.org/datasets/movielens/.

# reading data (RecommenderLab contains the sample of MovieLense data set)

data(MovieLense)
movie_matrix<-MovieLense
dim(movie_matrix)

## [1]  943 1664

#  alternative way to load MovieLense dataset from the link https://grouplens.org/datasets/movielens/

# reading data
# ratings = read.csv("/Users/Olga/Desktop/ml-latest-small/ratings.csv")

# transforming to a wide format 
# data<-ratings%>% select (movieId, userId, rating) %>% spread (movieId,rating)

#  converting the data set into a real rating matrix
# movie_matrix <- as(as.matrix(data[-c(1)]), "realRatingMatrix")

#  looking at the matrix structure and ferst 5 rows of the real rating matrix
str(movie_matrix)

## Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
##   ..@ data     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. ..@ i       : int [1:99392] 0 1 4 5 9 12 14 15 16 17 ...
##   .. .. ..@ p       : int [1:1665] 0 452 583 673 882 968 994 1386 1605 1904 ...
##   .. .. ..@ Dim     : int [1:2] 943 1664
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : chr [1:943] "1" "2" "3" "4" ...
##   .. .. .. ..$ : chr [1:1664] "Toy Story (1995)" "GoldenEye (1995)" "Four Rooms (1995)" "Get Shorty (1995)" ...
##   .. .. ..@ x       : num [1:99392] 5 4 4 4 4 3 1 5 4 5 ...
##   .. .. ..@ factors : list()
##   ..@ normalize: NULL

head(movie_matrix@data [1:5,1:5])

## 5 x 5 sparse Matrix of class "dgCMatrix"
##   Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995)
## 1                5                3                 4                 3
## 2                4                .                 .                 .
## 3                .                .                 .                 .
## 4                .                .                 .                 .
## 5                4                3                 .                 .
##   Copycat (1995)
## 1              3
## 2              .
## 3              .
## 4              .
## 5              .

#  looking at the rating provided by user 1 and 100 for the first 5 movies
movie_matrix@data[1,1:5]

##  Toy Story (1995)  GoldenEye (1995) Four Rooms (1995) Get Shorty (1995) 
##                 5                 3                 4                 3 
##    Copycat (1995) 
##                 3

movie_matrix@data[100, 1:5]

##  Toy Story (1995)  GoldenEye (1995) Four Rooms (1995) Get Shorty (1995) 
##                 0                 0                 0                 0 
##    Copycat (1995) 
##                 0

The matrix contains 1664 movies and 943 users. 99392 ratings in total (the matrix is sparse).

# looking at the number of movies the 100's user has rated
length(movie_matrix@data[100,][movie_matrix@data[100,] > 0])

## [1] 56

#  checking total number of ratings given by the users
nratings(movie_matrix)

## [1] 99392

#  overall rating distribution
hist(getRatings(movie_matrix), main = "Distribution Of Ratings", xlim=c(0,5), breaks="FD")

The most popular rating is 4.

#  finding the most/least popular movies
ratings_binary<-binarize(movie_matrix, minRating = 1)
ratings_binary

## 943 x 1664 rating matrix of class 'binaryRatingMatrix' with 99392 ratings.

ratings_sum<-colSums(ratings_binary)
ratings_sum_df<- data.frame(movie = names(ratings_sum), pratings = ratings_sum)
head(ratings_sum_df[order(-ratings_sum_df$pratings), ],10)

##                                                       movie pratings
## Star Wars (1977)                           Star Wars (1977)      583
## Contact (1997)                               Contact (1997)      509
## Fargo (1996)                                   Fargo (1996)      508
## Return of the Jedi (1983)         Return of the Jedi (1983)      507
## Liar Liar (1997)                           Liar Liar (1997)      485
## English Patient, The (1996)     English Patient, The (1996)      481
## Scream (1996)                                 Scream (1996)      478
## Toy Story (1995)                           Toy Story (1995)      452
## Air Force One (1997)                   Air Force One (1997)      431
## Independence Day (ID4) (1996) Independence Day (ID4) (1996)      429

tail(ratings_sum_df[order(-ratings_sum_df$pratings), ],10)

##                                                                               movie
## Further Gesture, A (1996)                                 Further Gesture, A (1996)
## Mirage (1995)                                                         Mirage (1995)
## Mamma Roma (1962)                                                 Mamma Roma (1962)
## Sunchaser, The (1996)                                         Sunchaser, The (1996)
## War at Home, The (1996)                                     War at Home, The (1996)
## Sweet Nothing (1995)                                           Sweet Nothing (1995)
## Mat' i syn (1997)                                                 Mat' i syn (1997)
## B. Monkey (1998)                                                   B. Monkey (1998)
## You So Crazy (1994)                                             You So Crazy (1994)
## Scream of Stone (Schrei aus Stein) (1991) Scream of Stone (Schrei aus Stein) (1991)
##                                           pratings
## Further Gesture, A (1996)                        1
## Mirage (1995)                                    1
## Mamma Roma (1962)                                1
## Sunchaser, The (1996)                            1
## War at Home, The (1996)                          1
## Sweet Nothing (1995)                             1
## Mat' i syn (1997)                                1
## B. Monkey (1998)                                 1
## You So Crazy (1994)                              1
## Scream of Stone (Schrei aus Stein) (1991)        1

Normalization

Two normalization techniques are going to be implemented:

Centering

it removes the row bias by substructing row mean value from all row values
makes mean - 0
does not change the scale of variable
used if all variables in a data set are measured in same scale.

Z-score

obtained by substructing mean from individual scores and dividing it by standard deviation
scaling data AND changes scale
used when variables are measured in a different scale

data.norm.c<-normalize(movie_matrix, method="center")
data.norm.z<-normalize(movie_matrix, method="Z-score")

#  ploting rating distribution for Raw, Normalized and Z-score Normalized Ratings 
par(mfrow = c(3,1))
plot(density(getRatings(movie_matrix)),main = 'Raw')
plot(density(getRatings(data.norm.c)),main = 'Normalized')
plot(density(getRatings(data.norm.z)),main = 'Z-score')

par(mfrow = c(1,1))

From the plots we can see that using normalization techniques we have brought the data close to the normal distribution from it’s original distribution.

User-based collaborative filtering

Similar users will have similar movie tastes
It is a memory based model as loads whole rating matrix into memory

User-based collaborative filtering is a two-step process: first step is the finding for a given user his neighbours (using similarity measures such as Pearson coefficient or Cosine distance). For item not rated by user, we use average rating of that item of user’s neighbours.

Now I am going to build user-based models employing two types of normalization techniques and two types of similarity measures: Pearson coefficient or Cosine distance.

Cross-validation scheme will be used to evaluate the models’ performance.

#  creating evaluation scheme (5-fold CV; everything that is above 3 is considered a good rating; 5 neighbours will be find for a given user(item) to make recommendation)
set.seed(123)
es<- evaluationScheme(movie_matrix, method = "cross", train = 0.9, given = 5, goodRating = 3, k = 5)

#  building a recommendation using raw data and Pearson coefficient as a similarity measure to find neighbours
param1 = list(normalize = NULL, method = "Pearson")
result_1<- evaluate(es, method = "UBCF",  param = param1, type = "ratings")

## UBCF run fold/sample [model time/prediction time]
##   1  [0.008sec/0.88sec] 
##   2  [0.001sec/0.847sec] 
##   3  [0sec/0.686sec] 
##   4  [0sec/0.655sec] 
##   5  [0.001sec/0.83sec]

avg(result_1)

##         RMSE      MSE      MAE
## res 2.620887 6.871261 2.346896

#  building a recommendation using normalized data (centering) and Pearson coefficient as a similarity measure to find neighbours
param2 = list(normalize = "center", method = "Pearson")
result_2<-evaluate(es, method = "UBCF", param = param2, type = "ratings")

## UBCF run fold/sample [model time/prediction time]
##   1  [0.006sec/0.753sec] 
##   2  [0.007sec/0.634sec] 
##   3  [0.007sec/0.61sec] 
##   4  [0.006sec/0.768sec] 
##   5  [0.006sec/0.663sec]

avg(result_2)

##         RMSE      MSE       MAE
## res 1.103812 1.218589 0.8736964

#  building a recommendation using normalized data (Z-score) and Pearson coefficient as a similarity measure to find neighbours
param3 = list(normalize = "Z-score", method = "Pearson")
result_3<-evaluate(es, method = "UBCF", param = param3, type = "ratings")

## UBCF run fold/sample [model time/prediction time]
##   1  [0.035sec/0.637sec] 
##   2  [0.039sec/0.627sec] 
##   3  [0.037sec/0.843sec] 
##   4  [0.19sec/0.642sec] 
##   5  [0.035sec/0.657sec]

avg(result_3)

##         RMSE      MSE       MAE
## res 1.103173 1.217176 0.8740004

# Cosine similarity

#  building a recommendation using raw data and Cosine distance as a similarity measure to find neighbours
param4 = list(method = "Cosine")
result_4<- evaluate(es, method = "UBCF", param = param4, type = "ratings")

## UBCF run fold/sample [model time/prediction time]
##   1  [0.007sec/0.563sec] 
##   2  [0.007sec/0.564sec] 
##   3  [0.006sec/0.596sec] 
##   4  [0.008sec/0.721sec] 
##   5  [0.007sec/0.645sec]

avg(result_4)

##         RMSE      MSE       MAE
## res 1.122067 1.259213 0.8906996

#  building a recommendation using normalized data (centering) and Cosine distance as a similarity measure to find neighbours
param5 = list(normalize = "center", method = "Cosine")
result_5<-evaluate(es, method = "UBCF", param = param5, type = "ratings")

## UBCF run fold/sample [model time/prediction time]
##   1  [0.007sec/0.773sec] 
##   2  [0.006sec/0.775sec] 
##   3  [0.018sec/0.644sec] 
##   4  [0.006sec/0.667sec] 
##   5  [0.006sec/0.552sec]

avg(result_5)

##         RMSE      MSE       MAE
## res 1.122067 1.259213 0.8906996

#  building a recommendation using normalized data (Z-score) and Cosine distance as a similarity measure to find neighbours
param6 = list(normalize = "Z-score", method = "Cosine")
result_6<-evaluate(es, method = "UBCF", param = param6, type = "ratings")

## UBCF run fold/sample [model time/prediction time]
##   1  [0.035sec/0.689sec] 
##   2  [0.035sec/0.575sec] 
##   3  [0.038sec/0.549sec] 
##   4  [0.035sec/0.681sec] 
##   5  [0.034sec/0.567sec]

avg(result_6)

##         RMSE      MSE       MAE
## res 1.121713 1.258406 0.8910533

Models’ performance is been summarized below:

m1<-cbind(RMSE=avg(result_1))
m2<-cbind(RMSE=avg(result_2))
m3<-cbind(RMSE=avg(result_3))
m4<-cbind(RMSE=avg(result_4))
m5<-cbind(RMSE=avg(result_5))
m6<-cbind(RMSE=avg(result_6))


summary = rbind(m1, m2, m3, m4, m5, m6)
rownames(summary) <- c("model_1","model_2", "model_3", "model_4", "model_5", "model_6")
summary

##             RMSE      MSE       MAE
## model_1 2.620887 6.871261 2.3468962
## model_2 1.103812 1.218589 0.8736964
## model_3 1.103173 1.217176 0.8740004
## model_4 1.122067 1.259213 0.8906996
## model_5 1.122067 1.259213 0.8906996
## model_6 1.121713 1.258406 0.8910533

The best performed model in terms of lowest RMSE is model 3 which uses Person similarity measure and Z-score normalized data. Let’s look at the confusion matrix and ROC curve for that model for 5, 10 or 15 recommendations.

param3 = list(normalize = "Z-score", method = "Pearson")
result_3<-evaluate(es, method = "UBCF", param = param3, type = "topNList", n = c(5,10,15))

## UBCF run fold/sample [model time/prediction time]
##   1  [0.036sec/0.701sec] 
##   2  [0.042sec/0.85sec] 
##   3  [0.042sec/0.814sec] 
##   4  [0.038sec/0.817sec] 
##   5  [0.037sec/0.692sec]

avg(result_3)

##          TP       FP       FN       TN precision     recall        TPR
## 5  2.183246 2.712042 80.42723 1573.677 0.4459976 0.03836427 0.03836427
## 10 3.893194 5.897382 78.71728 1570.492 0.3977301 0.06401574 0.06401574
## 15 5.364398 9.321466 77.24607 1567.068 0.3653421 0.08320607 0.08320607
##            FPR
## 5  0.001690402
## 10 0.003679641
## 15 0.005820184

plot(result_3, annotate = TRUE, main = "ROC curve (model 3)")

Item-based (model-based) collaborative filtering

users will prefer those products similar to ones they have already rated
this method explorers the relationship between items
for each item top n items are stored (rather then storing all the items for an efficiency purposes) based on similarity measures (Cosine or Pearson). Weighted sum is used to finally make recommendation for user.

Now I am going to build item-based models employing two types of normalization techniques and two types of similarity measures: Pearson coefficient or Cosine distance.

Cross-validation scheme will be used to evaluate the models’ performance.

#  building a recommendation using raw data and Pearson coefficient as a similarity measure to find neighbours
param7 = list(normalize = NULL, method = "Pearson")
result_7<-evaluate(es, method = "IBCF", param = param7, type = "ratings")

## IBCF run fold/sample [model time/prediction time]
##   1  [23.572sec/0.203sec] 
##   2  [15.337sec/0.04sec] 
##   3  [18.167sec/0.061sec] 
##   4  [17.445sec/0.055sec] 
##   5  [13.928sec/0.048sec]

avg(result_7)

##         RMSE      MSE      MAE
## res 1.523747 2.326853 1.135707

#  building a recommendation using normalized data (centering) and Pearson coefficient as a similarity measure to find neighbours
param8 = list(normalize = "center", method = "Pearson")
result_8<-evaluate(es, method = "IBCF", param = param8, type = "ratings")

## IBCF run fold/sample [model time/prediction time]
##   1  [13.748sec/0.036sec] 
##   2  [14.094sec/0.055sec] 
##   3  [13.914sec/0.039sec] 
##   4  [14.81sec/0.038sec] 
##   5  [14.353sec/0.05sec]

avg(result_8)

##         RMSE      MSE     MAE
## res 1.467439 2.165082 1.07166

#  building a recommendation using normalized data (Z-score) and Pearson coefficient as a similarity measure to find neighbours
param9 = list(normalize = "Z-score", method = "Pearson")
result_9<-evaluate(es, method = "IBCF", param = param9, type = "ratings")

## IBCF run fold/sample [model time/prediction time]
##   1  [13.696sec/0.042sec] 
##   2  [15.121sec/0.063sec] 
##   3  [13.876sec/0.072sec] 
##   4  [15.392sec/0.066sec] 
##   5  [14.332sec/0.048sec]

avg(result_9)

##        RMSE     MSE      MAE
## res 1.50876 2.29199 1.110934

# Cosine similarity

#  building a recommendation using raw data and Cosine similarity as a similarity measure to find neighbours
param10 = list(method = "Cosine")
result_10<- evaluate(es, method = "IBCF", param = param10, type = "ratings")

## IBCF run fold/sample [model time/prediction time]
##   1  [14.761sec/0.036sec] 
##   2  [14.585sec/0.034sec] 
##   3  [14.292sec/0.033sec] 
##   4  [14.458sec/0.052sec] 
##   5  [14.429sec/0.05sec]

avg(result_10)

##         RMSE      MSE      MAE
## res 1.434383 2.101411 1.028052

#  building a recommendation using normalized data (centering) and Cosine similarity as a similarity measure to find neighbours
param11 = list(normalize = "center", method = "Cosine")
result_11<-evaluate(es, method = "IBCF", param = param11, type = "ratings")

## IBCF run fold/sample [model time/prediction time]
##   1  [14.463sec/0.05sec] 
##   2  [14.45sec/0.05sec] 
##   3  [14.523sec/0.046sec] 
##   4  [14.429sec/0.05sec] 
##   5  [14.564sec/0.049sec]

avg(result_11)

##         RMSE      MSE      MAE
## res 1.434383 2.101411 1.028052

#  building a recommendation using normalized data (Z-score) and Cosine similarity as a similarity measure to find neighbours
param12 = list(normalize = "Z-score", method = "Cosine")
result_12<-evaluate(es, method = "IBCF", param = param12, type = "ratings")

## IBCF run fold/sample [model time/prediction time]
##   1  [14.587sec/0.059sec] 
##   2  [14.209sec/0.057sec] 
##   3  [15.315sec/0.065sec] 
##   4  [15.483sec/0.055sec] 
##   5  [15.016sec/0.055sec]

avg(result_12)

##         RMSE      MSE      MAE
## res 1.471373 2.215995 1.055984

Models’ performance is been summarized below:

m7<-cbind(RMSE=avg(result_7))
m8<-cbind(RMSE=avg(result_8))
m9<-cbind(RMSE=avg(result_9))
m10<-cbind(RMSE=avg(result_10))
m11<-cbind(RMSE=avg(result_11))
m12<-cbind(RMSE=avg(result_12))

summary2 = rbind(m7, m8, m9, m10, m11, m12)
rownames(summary2) <- c("model_7","model_8", "model_9", "model_10", "model_11", "model_12")
summary2

##              RMSE      MSE      MAE
## model_7  1.523747 2.326853 1.135707
## model_8  1.467439 2.165082 1.071660
## model_9  1.508760 2.291990 1.110934
## model_10 1.434383 2.101411 1.028052
## model_11 1.434383 2.101411 1.028052
## model_12 1.471373 2.215995 1.055984

The best model is model 10 and model 11 which was build using Cosine similarity measure.

Let’s look at the confusion matrix and ROC curve of model 10 for 5, 10 or 15 recommendations.

param10 = list(method = "Cosine")
result_10<- evaluate(es, method = "IBCF", param = param10, type = "topNList", n = c(5,10,15))

## IBCF run fold/sample [model time/prediction time]
##   1  [14.507sec/0.061sec] 
##   2  [14.683sec/0.064sec] 
##   3  [14.635sec/0.071sec] 
##   4  [14.955sec/0.065sec] 
##   5  [15.055sec/0.048sec]

avg(result_10)

##            TP        FP       FN       TN  precision       recall
## 5  0.07643979  4.765445 82.53403 1571.624 0.01578546 0.0007658662
## 10 0.13507853  9.153927 82.47539 1567.236 0.01432130 0.0014199840
## 15 0.17905759 12.734031 82.43141 1563.655 0.01332812 0.0018856694
##             TPR         FPR
## 5  0.0007658662 0.003028092
## 10 0.0014199840 0.005817471
## 15 0.0018856694 0.008093226

plot(result_10, annotate = TRUE, main = "ROC curve (model 10)")

As we see model_10 performs slightly worse that the best model (model_3) of user-based approach.

In general, the user-based models performed slightly better than item-based models, but this approach requires more memory.

Let’s build the complete model (user-based recommendation model using Pearson coefficient) and make recommendations.

# splitting data on train and test sets
esf<- evaluationScheme(movie_matrix, method = "split", train = 0.9, given = 5, goodRating = 3)
train <-getData(esf, "train")
test <-getData(esf, "unknown")
test_known <- getData(esf, "known")
#  building user-based recommendation model
param_f<- list (method = "Pearson", nn=10)
final_model <- Recommender(train, method = "UBCF", param = param_f)
final_model

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 848 users.

# getting recommendations (top 10)
final_prediction<- predict (final_model, test, n = 10, type = "topNList")
final_prediction@items[1]

## $`14`
##  [1] 285 338 303 299 676 270 305 309 342 890

final_prediction@ratings[1]

## $`14`
##  [1] 4.844350 4.683689 4.427757 4.384879 4.377291 4.359480 4.345418
##  [8] 4.320462 4.320462 4.320462

Project_2

Olga Shiligin

13/06/2019

Introduction

Data Exploration

Normalization

User-based collaborative filtering

Item-based (model-based) collaborative filtering