# loading libraries
library(tidyverse)
library(tidyr)
library(readr)
library(data.table)
library(dplyr)
library(recommenderlab)
library(knitr)
library(kableExtra)
library(ggplot2)
In this project, we are going to use collaborative filtering in which user’s preference would be used to create the recommendation system based on previously liked movies or relevant items i.e item based or user based recommendation systems. We are going to check the accuracy of the model later on and then the best model will be used to create a system in Shiny App that would provide a recommendations based on user’s preferences. We decided to continue working on the MovieLens dataset that contains both movies and reviews. Once, the data is trained and model is selected then we will deploy the results in Shiny App to provide the applicability of the model.
This data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. The data set contains about 100,868 ratings (1-5) from 943 users on 9472 movies. Movie metadata is also provided in MovieLenseMeta. It is a built-in dataset in “recommenderlab” library. Furthermore, the data contains other datasets but we will only work with “movies.csv and ratings.csv”. The data was taken from “https://grouplens.org/datasets/movielens/latest/” which had two datasets. We took smaller version as the bigger version had 58000 movies reviewed by 280,000 users and it was taking long time to work on that version.
Let’s get the data now
ratings <- read.csv("C:/Users/hukha/Desktop/MS - Data Science/Data 612 - Recommender Systems/Data 612 - Recommender Systems/ml-latest-small/ratings.csv", stringsAsFactors = FALSE) %>% select(-c(timestamp))
movies <- read.csv("C:/Users/hukha/Desktop/MS - Data Science/Data 612 - Recommender Systems/Data 612 - Recommender Systems/ml-latest-small/movies.csv")
There are 100836 reviews in the review dataset from different users that were reviewed by different users for 9472 movies. Movie dataset contains list of movies along with their id which wouuld be useful to join with the review dataset later on. Movie dataset also contains genres of the movies that would be helpful for the user to select their preferences and can be categorized accordingly. Later on, we will split these genres into different columns.
# Ratings
head(ratings)
## userId movieId rating
## 1 1 1 4
## 2 1 3 4
## 3 1 6 4
## 4 1 47 5
## 5 1 50 5
## 6 1 70 3
summary(ratings$rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.500 3.000 3.500 3.502 4.000 5.000
# Movies
head(movies)
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
As we said previously that movie data contains list of movies with their genres. The problem is that movie genres are stored together in one column which needs to be seperated as it will be very useful for user’s preferences for movies that’s why we have to split the data.
#head(movies)
search_movies <- movies %>% separate(col= genres, into= c(as.character(1:10)), sep="[|]")
## Warning: Expected 10 pieces. Missing pieces filled with `NA` in 9741
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
head(search_movies) %>% kable(caption="List of Movies with Genres") %>% kable_styling()
movieId | title | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Toy Story (1995) | Adventure | Animation | Children | Comedy | Fantasy | NA | NA | NA | NA | NA |
2 | Jumanji (1995) | Adventure | Children | Fantasy | NA | NA | NA | NA | NA | NA | NA |
3 | Grumpier Old Men (1995) | Comedy | Romance | NA | NA | NA | NA | NA | NA | NA | NA |
4 | Waiting to Exhale (1995) | Comedy | Drama | Romance | NA | NA | NA | NA | NA | NA | NA |
5 | Father of the Bride Part II (1995) | Comedy | NA | NA | NA | NA | NA | NA | NA | NA | NA |
6 | Heat (1995) | Action | Crime | Thriller | NA | NA | NA | NA | NA | NA | NA |
rm <- dcast(ratings, userId~movieId, value.var="rating", na.rm=FALSE)
## Warning in dcast(ratings, userId ~ movieId, value.var = "rating", na.rm =
## FALSE): The dcast generic in data.table has been passed a data.frame and
## will attempt to redirect to the reshape2::dcast; please note that reshape2
## is deprecated, and this redirection is now deprecated as well. Please
## do this redirection yourself like reshape2::dcast(ratings). In the next
## version, this warning will become an error.
rm <- as.matrix(rm[,-1])
rm <- as(rm, "realRatingMatrix")
rm
## 610 x 9724 rating matrix of class 'realRatingMatrix' with 100836 ratings.
We are going to use built-in functions from “recommenderlab” to create recommendation systems. Let’s explore which functions exist in recommenderlab that can be useful later.
recommender_models <- recommenderRegistry$get_entries(dataType= "realRatingMatrix")
names(recommender_models)
## [1] "ALS_realRatingMatrix" "ALS_implicit_realRatingMatrix"
## [3] "IBCF_realRatingMatrix" "LIBMF_realRatingMatrix"
## [5] "POPULAR_realRatingMatrix" "RANDOM_realRatingMatrix"
## [7] "RERECOMMEND_realRatingMatrix" "SVD_realRatingMatrix"
## [9] "SVDF_realRatingMatrix" "UBCF_realRatingMatrix"
Above are the names of algorithms in recommenderlab. We will use IBCF an dUBCF models and their compare their performance to see which one is performing better in terms of recommending movies. Both are collaborative filtering which is based on similarity between either users or items and similarity can be measured through cosine, pearson and jaccard.
Now we will determine the similarity of the first few users and movies through cosine method. Let’s see how it looks like.
# Similarity among the users
similarity_users <- as.matrix(similarity(rm[1:5, ], method="cosine", which="users"))
image(similarity_users, main="User similarity")
# Similarity among the movies
similarity_movies <- as.matrix(similarity(rm[, 1:5], method="cosine", which="items"))
image(similarity_movies, main="Movies similarity")
Let’s visualize the overall ratings from the users given to the movies. All the movies that are not rated would be excluded from graph.
rating_values <- as.vector(rm@data)
rating_values <- factor(rating_values[rating_values != 0])
# Now let's visualize it
plot(rating_values, main= " Ratings Distribution")
We can see that most of the reviews exist between 3 to 4 while some around 5 and least less than 2.
view_per_movie <- colCounts(rm) # Total views for each movie
count_views <- data.frame(movie = names(view_per_movie), views= view_per_movie)
count_views <- count_views[order(count_views$views, decreasing=TRUE), ]
count_views$title <- NA
# Adding movie titles
for (i in 1:9724){
count_views[i,3] <- as.character(subset(movies,
movies$movieId == count_views[i,1])$title)
}
head(count_views, 15) %>% kable() %>% kable_styling(full_width = FALSE)
movie | views | title | |
---|---|---|---|
356 | 356 | 329 | Forrest Gump (1994) |
318 | 318 | 317 | Shawshank Redemption, The (1994) |
296 | 296 | 307 | Pulp Fiction (1994) |
593 | 593 | 279 | Silence of the Lambs, The (1991) |
2571 | 2571 | 278 | Matrix, The (1999) |
260 | 260 | 251 | Star Wars: Episode IV - A New Hope (1977) |
480 | 480 | 238 | Jurassic Park (1993) |
110 | 110 | 237 | Braveheart (1995) |
589 | 589 | 224 | Terminator 2: Judgment Day (1991) |
527 | 527 | 220 | Schindler’s List (1993) |
2959 | 2959 | 218 | Fight Club (1999) |
1 | 1 | 215 | Toy Story (1995) |
1196 | 1196 | 211 | Star Wars: Episode V - The Empire Strikes Back (1980) |
50 | 50 | 204 | Usual Suspects, The (1995) |
2858 | 2858 | 204 | American Beauty (1999) |
Now let’s visualize these top watching movies
ggplot(count_views[1:15,], aes(reorder(x=title, views),y= views))+geom_bar(stat="identity")+theme(axis.text.x=element_text(angle=45, hjust=1))+labs(title="Top Watching Movies", x="Movies", y="Views")
Forrest Gump and Shawshank Redemption are the mostly watching movies as per the dataset last updated in 2018.
There are a lot of missing values in the data which represents the movies that were not reviewed by the users and generally speaking it’s almost impossible for all users to rate all the movies existed in the database. That being said, we will get the relevant data i.e. minimum numbers of users per rated movie and minimum views per view that are 50 for both
movie_ratings <- rm[rowCounts(rm) > 50, colCounts(rm) > 50]
movie_ratings
## 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.
Now let’s visualize the reviews for the relevant data.
qplot(rowMeans(movie_ratings), binwidth=0.1, main="Distribution of Ratings")
Data normalization is a significant step before creating recommender system because it helps avoid the biasness in the result and thus provides a better and accurate model. We will use “normalize” function from recommenderlab library.
movie_ratings_norm <- normalize(movie_ratings)
sum(rowMeans(movie_ratings_norm) > 0.0001)
## [1] 0
Now let’s create a heatmap to see how normalize function from recommenderlab helped removing the biasness.
image(movie_ratings_norm[rowCounts(movie_ratings) > quantile(rowCounts(movie_ratings),0.95),
colCounts(movie_ratings) > quantile(colCounts(movie_ratings), 0.95)], main="Normalized Heatmap for top data")
As we can see most of the data is almost normalized. Reason for few points that are far away from 0 is because the data is taken from top movies and not the entire data. Data is normalized now and we can go ahead and create recommender systems on the dataset.
The data is good to go now but one last part could be very helpful that is conversion of data to binary. We will convert missing values or bad ratings to 0 and 1 will be a movie that has rated. We can do same with 1 as above threshold ratings. Let’s create heatmap for both and see how it looks like.
movie_ratings_watched <- binarize(movie_ratings, minRating = 1)
boolean_min_movies <- quantile(rowCounts(movie_ratings), 0.95)
boolean_min_users <- quantile(colCounts(movie_ratings), 0.95)
image(movie_ratings_watched[rowCounts(movie_ratings) > boolean_min_movies,
colCounts(movie_ratings) > boolean_min_users],
main = "Heatmap - Top Users & Movies")
movie_ratings_good <- binarize(movie_ratings, minRating = 3)
image(movie_ratings_good[rowCounts(movie_ratings) > boolean_min_movies,
colCounts(movie_ratings) > boolean_min_users],
main = "Heatmap - Top Users & Movies")
Second heatmap shows that there more movies with no or bad ratings than movies that were not viewed.
This type of collaborative filtering is based on similarity between two items rated sy similar user. For each item, k most similar items are needed to be identified and identify the items that are most similar to the users’ ratings for each user.
Before building a model, let’s split a model with 80% in training and 20% in the test set.
# Criteria
which_train <- sample(x= c(TRUE, FALSE), size= nrow(movie_ratings), replace= TRUE, prob= c(0.8, 0.2))
# Creating train and test
movie_train <- movie_ratings[which_train, ]
movie_test <- movie_ratings[!which_train, ]
Now since we are done creating train and test sets through splitting from movie_ratings, we can go ahead and create IBCF model.
For sake of learning, let’s see what parameters IBCF algorithm has and then we will further talk about it.
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
##
## $method
## [1] "Cosine"
##
## $normalize
## [1] "center"
##
## $normalize_sim_matrix
## [1] FALSE
##
## $alpha
## [1] 0.5
##
## $na_as_zero
## [1] FALSE
By default k is 30, method of similarity is calculated via cosine. Now let’s create a model.
# Building a model
movie_model <- Recommender(data= movie_train, method="IBCF", parameter= list(k=30))
movie_model
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 309 users.
A model has been built on train dataset.
# Prediction
model_predictions1 <- predict(object= movie_model, newdata= movie_test, n=10)
model_predictions1
## Recommendations as 'topNList' with n = 10 for 69 users.
# Let's check recommended movies for user1
user1_recommendation <- model_predictions1@items[[1]]
movies_user1 <- model_predictions1@itemLabels[user1_recommendation]
#movies_user1
for (i in 1:8){
movies_user1[i] <- as.character(subset(movies, movies$movieId == movies_user1[i])$title)
}
movies_user1
## [1] "First Knight (1995)"
## [2] "Judge Dredd (1995)"
## [3] "Lion King, The (1994)"
## [4] "City Slickers II: The Legend of Curly's Gold (1994)"
## [5] "Cliffhanger (1993)"
## [6] "In the Line of Fire (1993)"
## [7] "Aladdin (1992)"
## [8] "Beauty and the Beast (1991)"
## [9] "597"
## [10] "708"
Above movies are the recommended movies for User 1.
users <- paste0("User ", seq(1:8))
matrix_recommendation <- sapply(model_predictions1@items,
function(x){ as.integer(colnames(movie_ratings)[x]) })
m <- matrix_recommendation[,1:8]
colnames(m) <- users
m
## User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8
## [1,] 168 2080 161 3 349 3033 70 3147
## [2,] 173 903 1220 7 357 480 110 587
## [3,] 364 49530 4027 36 919 485 111 54503
## [4,] 432 52973 6942 231 1097 858 145 63082
## [5,] 434 1036 68954 293 2078 1196 160 3408
## [6,] 474 1923 53125 329 48385 1200 185 317
## [7,] 588 21 595 356 122904 1247 223 2011
## [8,] 595 300 32587 434 1645 2502 231 2355
## [9,] 597 357 5989 435 2115 2571 500 3489
## [10,] 708 508 1625 509 2617 2617 555 30707
Above matrix shows the 10 recommended movies for 8 first rows from the data.
num_items <- factor(table(matrix_recommendation))
num_items_sorted <- sort(num_items, decreasing = TRUE)
num_items_top <- head(num_items_sorted, n = 10)
table_top <- data.frame(as.integer(names(num_items_top)),
num_items_top)
for (i in 1:10){
table_top[i,1] <- as.character(subset(movies,
movies$movieId == table_top[i,1])$title)
}
colnames(table_top) <- c("Movie Title", "# Items")
head(table_top)
## Movie Title # Items
## 3 Grumpier Old Men (1995) 13
## 70 From Dusk Till Dawn (1996) 8
## 172 Johnny Mnemonic (1995) 8
## 16 Casino (1995) 7
## 292 Outbreak (1995) 7
## 587 Ghost (1990) 7
Most of the movies have been recommended only a few times and a few movies have been recommended many times. IBCF recommends items on the basis of the similarity matrix. It’s an eager-learning model, that is, once it is built, it doesn’t need to access the initial data. For each item, the model stores the k-most similar, so the amount of information is small once the model is built. This is an advantage in the presence of lots of data. In addition, this algorithm is efficient and scalable, so it works well with big rating matrices. Its accuracy is rather good, compared with other recommendation models.
This type of collaborative filtering focuses on user-based approach rather than item-based similarity. Similarity is calculated among the users and predicts that they might like same movies. In this approach, similar users are initially identified and then top-rated items rated by similar users are recommended.
For each new user, these are the steps:
1- Measure how similar each user is to the new one. Like IBCF, popular similarity measures are correlation and cosine. 2- Identify the most similar users. The options are take account of the top k users (k-nearest_neighbors) and take account of the users whose similarity is above a defined threshold 3- Rate the items purchased by the most similar users. The rating is the average rating among similar users and the approaches are avg rating and weighted avg rating. 4- Pick the top-rated items
Let’s go ahead and create a model.
recommender_models2 <- Recommender(movie_train, method="UBCF")
recommender_models2
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 309 users.
Above function went through train dataset and learned 312 users before going for prediction.
# Details of model
results_model <- getModel(recommender_models2)
results_model
## $description
## [1] "UBCF-Real data: contains full or sample of data set"
##
## $data
## 309 x 436 rating matrix of class 'realRatingMatrix' with 30252 ratings.
## Normalized using center on rows.
##
## $method
## [1] "cosine"
##
## $nn
## [1] 25
##
## $sample
## [1] FALSE
##
## $normalize
## [1] "center"
##
## $verbose
## [1] FALSE
results_model$data
## 309 x 436 rating matrix of class 'realRatingMatrix' with 30252 ratings.
## Normalized using center on rows.
As we created and implemented the model in IBCF, we will go ahead in a same way here and implement the model on the test data and see how it looks like.
model_predictions2 <- predict(object= recommender_models2, newdata= movie_test, n= 10)
model_predictions2
## Recommendations as 'topNList' with n = 10 for 69 users.
The above function created top 10 recommended movies for 66 users.
Let’s check out the first 4 users and see how it looks like.
users <- paste0("User ", seq(1:8))
matrix_recommendation2 <- sapply(model_predictions2@items, function(x){
colnames(movie_ratings)[x]
} )
m <- matrix_recommendation2[, 1:8]
colnames(m) <- users
m
## User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8
## [1,] "8961" "589" "1387" "318" "49272" "589" "110" "593"
## [2,] "1193" "2028" "1200" "58559" "541" "1198" "337" "223"
## [3,] "904" "223" "1035" "5952" "2571" "260" "47" "36"
## [4,] "4995" "49272" "919" "4993" "356" "1214" "457" "17"
## [5,] "60069" "3996" "2804" "68954" "4896" "908" "36" "337"
## [6,] "293" "457" "1097" "4973" "8961" "1291" "1617" "2700"
## [7,] "4226" "2355" "1704" "7153" "1198" "480" "16" "1197"
## [8,] "48780" "1584" "2791" "356" "5418" "750" "904" "457"
## [9,] "5618" "1485" "1203" "7361" "6711" "588" "597" "235"
## [10,] "91529" "588" "1028" "49272" "1270" "60069" "349" "594"
The above matrix shows the top 10 recommended movies for first 8 users. These numbers show the movie ids.
num_items_sorted <- sort(num_items, decreasing = TRUE)
num_items_top <- head(num_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(num_items_top)), num_items_top)
for (i in 1:4){
table_top[i,1] <- as.character(subset(movies,
movies$movieId == table_top[i,1])$title)
}
colnames(table_top) <- c("Movie Titles", "# Items")
head(table_top)
## Movie Titles # Items
## 3 Grumpier Old Men (1995) 13
## 70 From Dusk Till Dawn (1996) 8
## 172 Johnny Mnemonic (1995) 8
## 16 Casino (1995) 7
Comparing the results of UBCF with IBCF helps in understanding the algorithm better. UBCF needs to access the initial data, so it is a lazy-learning model. Since it needs to keep the entire database in memory, it doesn’t work well in the presence of a big rating matrix. Also, building the similarity matrix requires a lot of computing power and time. However, UBCF’s accuracy is proven to be slightly more accurate than IBCF, so it’s a good option if the dataset is not too big.
In order to evaluate the recommender systems we have to go through the following steps:
1- Prepare the data to evaluate performance 2- Evaluate the performance of some models 3- Choose the best performing models 4- Optimize model parameters
To evaluate models, we need to build them with some data and test them on some other data i.e. train and test data which is also known as Data Splitting. We can also use bootstrapping k-fold approach.
We are going to split the data into train and test datasets with proportion of 80% and 20% respectively. For each user in the test set, we need to define how many items to use to generate recommendations. The remaining items will be used to test the model accuracy. It is better that this parameter is lower than the minimum number of items purchased by any user so that we don’t have users without items to test the models:
train_percent <- 0.80
min(rowCounts(movie_ratings))
## [1] 11
We will keep anything lower than 11 to come up with better results.
# Parameters
keep <- 8
rating_threshold <- 3
n_eval <- 1
# Let's split the data through evaluationScheme function from recommenderlab
eval_sets <- evaluationScheme(data= movie_ratings,
method="split",
train= train_percent,
given= keep,
goodRating = rating_threshold, k=n_eval)
eval_sets
## Evaluation scheme with 8 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.
Now in order to extract the sets we are going to use getData function from recommenderlab. Three sets will be extracted which are train, known and unknown. Known is the test set, with the item used to build the recommendations and unknwon is the test set, with the item used to test the recommendations. train is the training set.
getData(eval_sets, "train")
## 302 x 436 rating matrix of class 'realRatingMatrix' with 28057 ratings.
getData(eval_sets, "known")
## 76 x 436 rating matrix of class 'realRatingMatrix' with 608 ratings.
getData(eval_sets, "unknown")
## 76 x 436 rating matrix of class 'realRatingMatrix' with 7549 ratings.
As we can see, movie_ratings has splitted into three datasets and class is realRatingMatrix.
In this approach rather than splitting the data into two parts and train, we sample the rows with replacement. Same user can be sampled more than once and, if the training set has the same size as it did earlier, there will be more users in the test set. This is called bootstrapping and we can use it from recommenderlab library.
# Bootstrapping approach
eval_sets2 <- evaluationScheme(data= movie_ratings,
method="bootstrap",
train= train_percent,
given= keep,
goodRating= rating_threshold,
k=n_eval)
table_train <- table(eval_sets2@runsTrain[[1]])
n_repetitions <- factor(as.vector(table_train))
qplot(n_repetitions) +
ggtitle("Repetitions in the Training Set")
All of the users have been sampled few than four times.
Two previous approaches tested the recommender on part of the users. If, instead, we test the recommendation on each user, we could measure the performances much more accurately. We can split the data into some chunks, take a chunk out as the test set, and evaluate the accuracy and then check the other chunks. This approach is known as k-fold approach.
n_total <- 4
eval_sets3 <- evaluationScheme(data=movie_ratings,
method="cross-validation",
k= n_total,
given= keep,
goodRating= rating_threshold)
size_sets <- sapply(eval_sets3@runsTrain, length)
size_sets
## [1] 282 282 282 282
When using the k-fold approach then the result will be four sets of 282
In order to recommend items to new users, collaborative filtering estimates the ratings of items that are not yet purchased. Then, it recommends the top-rated items. At the moment, let’s forget about the last step. We can evaluate the model by comparing the estimated ratings with the real ones.
Let’s prepare the data for validation using k-fold approach as it seems the most accurate apporach for evaluation.
evaluate_accuracy <- evaluationScheme(data= movie_ratings,
method="cross-validation",
k= n_total,
given= keep,
goodRating= rating_threshold)
evaluate_accuracy
## Evaluation scheme with 8 items given
## Method: 'cross-validation' with 4 run(s).
## Good ratings: >=3.000000
## Data set: 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.
Now we have to define the model that needs to be evaluated.
evaluation_for_model <- "IBCF"
model_parameters <- NULL
# model for evaluation
evaluate_model <- Recommender(data= getData(evaluate_accuracy, "train"),
method= evaluation_for_model, parameter=model_parameters)
evaluate_model
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 282 users.
items_to_recommend <- 10
evaluate_prediction <- predict(object= evaluate_model, newdata= getData(evaluate_accuracy, "known"),
n= items_to_recommend,
type="ratings")
evaluate_prediction
## 96 x 436 rating matrix of class 'realRatingMatrix' with 15490 ratings.
qplot(rowCounts(evaluate_prediction)) +
geom_histogram(binwidth = 10) +
ggtitle("Distribution of Movies, per user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Number of movies per user is roughly between 100 and 220 approximately
Next is to measure the accuracy i.e. RMSE, MSE and MAE can be done using calcPredictionAccuracy from recommenderlab. Let’s calculate the accuracy
evaluate_accuracy2 <- calcPredictionAccuracy(x= evaluate_prediction,
data = getData(evaluate_accuracy, "unknown"),
byUser=TRUE)
head(evaluate_accuracy2, 10)
## RMSE MSE MAE
## [1,] 1.0741723 1.1538462 0.8461538
## [2,] 1.3660964 1.8662194 0.9317085
## [3,] 0.9149431 0.8371209 0.6887890
## [4,] 0.9326441 0.8698250 0.7492508
## [5,] 0.5679543 0.3225721 0.4626137
## [6,] 2.1634577 4.6805491 1.7139765
## [7,] 1.2817399 1.6428571 1.0000000
## [8,] 1.3416408 1.8000000 1.0000000
## [9,] 1.1180044 1.2499338 0.8951357
## [10,] 0.9443153 0.8917314 0.6846772
Now let’s visualize the RMSE for each user.
qplot(evaluate_accuracy2[, "RMSE"]) +
geom_histogram(binwidth = 0.1) +
ggtitle("Distribution of RMSE, by user")
Most of RMSE for each user fall between 0.5 and 2 approximately. Now let’s calculate a performance index of the whole model through setting byUser = FALSE.
evaluate_accuracy3 <- calcPredictionAccuracy(x= evaluate_prediction,
data= getData(evaluate_accuracy, "unknown"),
byUser= FALSE)
evaluate_accuracy3
## RMSE MSE MAE
## 1.2668545 1.6049204 0.9511248
RMSE for the entire model is 1.2362037
Another way to measure accuracies is by comparing the recommendations with the movies having a positive rating. Let’s go ahead and see how it looks like.
results <- evaluate(x= evaluate_accuracy,
method= evaluation_for_model,
n = seq(10,100,10))
## IBCF run fold/sample [model time/prediction time]
## 1 [0.3sec/0.03sec]
## 2 [0.29sec/0.02sec]
## 3 [0.32sec/0.01sec]
## 4 [0.3sec/0.01sec]
results
## Evaluation results for 4 folds/samples using method 'IBCF'.
Now using confusion matrix, we can extract a list of confusion matrices. Each element of the list correspondents to a different split of the k-fold. Let’s take a look at the first element
head(getConfusionMatrix(results))[[1]]
## TP FP FN TN precision recall TPR
## 10 1.510417 8.489583 73.67708 344.3229 0.1510417 0.02563164 0.02563164
## 20 3.281250 16.718750 71.90625 336.0938 0.1640625 0.05133733 0.05133733
## 30 5.250000 24.750000 69.93750 328.0625 0.1750000 0.07835795 0.07835795
## 40 7.020833 32.979167 68.16667 319.8333 0.1755208 0.10393649 0.10393649
## 50 8.510417 41.458333 66.67708 311.3542 0.1703014 0.12135278 0.12135278
## 60 10.135417 49.729167 65.05208 303.0833 0.1692598 0.14481997 0.14481997
## 70 11.614583 58.031250 63.57292 294.7812 0.1668231 0.16417442 0.16417442
## 80 13.270833 66.031250 61.91667 286.7812 0.1673626 0.18504456 0.18504456
## 90 14.645833 74.125000 60.54167 278.6875 0.1651420 0.20000847 0.20000847
## 100 16.093750 81.916667 59.09375 270.8958 0.1646384 0.21597834 0.21597834
## FPR
## 10 0.02442051
## 20 0.04785770
## 30 0.07046754
## 40 0.09383172
## 50 0.11789477
## 60 0.14147531
## 70 0.16505899
## 80 0.18772660
## 90 0.21074200
## 100 0.23295844
We need to see TP, FP, FN and TN in the table form. Rest of the columns are better to be visualized to see accuracy.
sun_col <- c("TP", "FP", "FN", "TN")
indices_summed <- Reduce("+", getConfusionMatrix(results))[, sun_col]
head(indices_summed)
## TP FP FN TN
## 10 6.833333 33.06250 307.6979 1364.406
## 20 14.114583 65.67708 300.4167 1331.792
## 30 21.291667 98.37500 293.2396 1299.094
## 40 28.427083 130.69792 286.1042 1266.771
## 50 34.885417 163.25000 279.6458 1234.219
## 60 41.770833 195.05208 272.7604 1202.417
Now let’s visualize ROC
plot(results, annotate=TRUE, main ="ROC Curve")
Two accuracy metrics are precision and recall. Precision is the percentage of recommended items that have been purchases. It’s the number of False Positive (FP) divided by the total number of positives(True Positive + False Positive). Recall is the percentage of purchased items that have been recommended. It is the number of True Positives (TP) divided by the total number of purchases (TP + FN).
If a small percentage of purchased items are recommended, the precision usually decreases and if a higher percentage of purchased items will be recommended that will increase recall. Let’s visualize precision-recall plot to see results
plot(results, "prec/rec", annotate=TRUE, main = "Precision / Recall")
The plot shows the tradeoff between precision and recall. Even if the curve is not perfectly monotonic, the trends are as expected.
In order to compare different models, we will create a baseline measure of the following list
baseline <- list(
IBCF_cosine = list(name = "IBCF",
param = list(method = "cosine")),
IBCF_pearson = list(name = "IBCF",
param = list(method = "pearson")),
UBCF_cosine = list(name = "UBCF",
param = list(method = "cosine")),
UBCF_pearson = list(name = "UBCF",
param = list(method = "pearson")),
Random = list(name = "RANDOM", param=NULL)
)
baseline
## $IBCF_cosine
## $IBCF_cosine$name
## [1] "IBCF"
##
## $IBCF_cosine$param
## $IBCF_cosine$param$method
## [1] "cosine"
##
##
##
## $IBCF_pearson
## $IBCF_pearson$name
## [1] "IBCF"
##
## $IBCF_pearson$param
## $IBCF_pearson$param$method
## [1] "pearson"
##
##
##
## $UBCF_cosine
## $UBCF_cosine$name
## [1] "UBCF"
##
## $UBCF_cosine$param
## $UBCF_cosine$param$method
## [1] "cosine"
##
##
##
## $UBCF_pearson
## $UBCF_pearson$name
## [1] "UBCF"
##
## $UBCF_pearson$param
## $UBCF_pearson$param$method
## [1] "pearson"
##
##
##
## $Random
## $Random$name
## [1] "RANDOM"
##
## $Random$param
## NULL
In order to evaluate the models properly, we need to test them, varying the number of items. For instance, we might want to recommend up to 100 movies to each user. Since 100 is already a big number of recommendations, we don’t need to include higher values.
n_recommendations <- c(1, 5, seq(10, 100, 10))
results_list <- evaluate(x = eval_sets,
method = baseline,
n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
## 1 [0.33sec/0.02sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.33sec/0.02sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.06sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.06sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/0.03sec]
sapply(results_list, class) == "evaluationResults"
## IBCF_cosine IBCF_pearson UBCF_cosine UBCF_pearson Random
## TRUE TRUE TRUE TRUE TRUE
avg_matrices <- lapply(results_list, avg)
head(avg_matrices$IBCF_cos[, 5:8])
## precision recall TPR FPR
## 1 0.1842105 0.002068477 0.002068477 0.00240568
## 5 0.1921053 0.011795969 0.011795969 0.01195598
## 10 0.1986842 0.023051880 0.023051880 0.02350827
## 20 0.2092105 0.048745419 0.048745419 0.04607351
## 30 0.2057018 0.073326526 0.073326526 0.07009694
## 40 0.1927632 0.089933032 0.089933032 0.09524585
It’s easier to see the results visually and identify the best model to go ahead. We an compare the models by building a chart displaying their ROC curves.
plot(results_list, annotate=1, legend="topleft", main="ROC Curve")
A good performance index is the area under ROC curve. We can notice that the highest is UBCF_pearson and hence it is the best performing technique. Let’s plot precision-recall plot to check the insights.
plot(results_list, "prec/rec", annotate = 1, legend = "bottomright", main="Precision - Recall")
We can see that UBCF_pearson is still at the top and hence the best performing technique out of all the other models.
Recommendation models often contain some numeric parameters. For instance, IBCF takes account of the k-closest items. How can we optimize k? In a similar way as we checked other paramters,we can test different values of a numeric parameter and check which parameter is best. So far, k was left to default value i.e. 30. Now left explore more values, ranging between 5 and 40. We will take UBCF_pearson as it was model we checked previously.
k <- c(5, 10, 20, 30, 40)
baseline <- lapply(k, function(k){
list(name = "UBCF",
param = list(method = "pearson", k = k))
})
names(baseline) <- paste0("UBCF_k_", k)
names(baseline)
## [1] "UBCF_k_5" "UBCF_k_10" "UBCF_k_20" "UBCF_k_30" "UBCF_k_40"
n_recommendations <- c(1,5, seq(10,100,10))
results_list <- evaluate(x= evaluate_accuracy, method= baseline, n= n_recommendations)
## UBCF run fold/sample [model time/prediction time]
## 1
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.09sec]
## 2
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.08sec]
## 3
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.08sec]
## 4
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.11sec]
## UBCF run fold/sample [model time/prediction time]
## 1
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0.01sec/0.08sec]
## 2
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.09sec]
## 3
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.1sec]
## 4
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.09sec]
## UBCF run fold/sample [model time/prediction time]
## 1
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.08sec]
## 2
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.09sec]
## 3
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.1sec]
## 4
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.1sec]
## UBCF run fold/sample [model time/prediction time]
## 1
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.08sec]
## 2
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.08sec]
## 3
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.09sec]
## 4
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.09sec]
## UBCF run fold/sample [model time/prediction time]
## 1
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0.02sec/0.08sec]
## 2
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.08sec]
## 3
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0.02sec/0.08sec]
## 4
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
## [0sec/0.1sec]
Let’s plot ROC curve to see which parameter is the best.
plot(results_list, annotate=1, legend="topleft", main= "ROC Curve")
plot(results_list, "prec/rec", annotate = 1, legend = "bottomright", main="Precision - Recall")
For UBCF_pearson, any k-value can be taken and it will give the same result. Reason we chose UBCF_pearson is because it was the best model technique based on previous graphs.
The idea was to create a recommendation system that will help user finding movies they might like. We chose to work on collaborative filtering (Both UBCF and IBCF) in the beginning to check which one of them works better. Result showed that User-based collaborative filtering with pearson is the best model out of the rest. k-value was also checked to see which one is better and ROC curve showed that they are all good so we can go with the default value of k which is 30. Unfortunately due to time constraint, we could not implement the model in Shiny App for users to check the recommendations.
Building a Recommendation System with R (Suresh Gorakala and Michele Usuelli, 2015)