The goal of this assignment is to implement, configure and evaluate recommender systems using different approaches. To achieve this, the assignment calls for using a user-item ratings dataset (from MovieLens in this case) and implement at least two recommendation algorithms:
The evaluation should compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc.
The MovieLens dataset was retrieved from the https://grouplens.org/datasets/movielens/ website using the small dataset with 100,000 ratings, 1,300 tags for 9,000 movies and 700 users. This is the most updated dataset (10/2016) with the most manageable size for R. The real challenge in that sense was creating a matrix of a size that R could handle.
library(dplyr)
library(tidyr)
library(ggplot2)
library(recommenderlab)
library(knitr)
The MovieLens matrix was created combining two datasets (movies and ratings). The movies dataset included the movieId, the title and the genres, while the ratings dataset included the userId, the movieId, the movie rating and a timestamp.
movies = read.csv("https://raw.githubusercontent.com/Galanopoulog/DATA643/master/Project%202/movies.csv",
header = TRUE, sep = ",", stringsAsFactors = FALSE)
ratings = read.csv("https://raw.githubusercontent.com/Galanopoulog/DATA643/master/Project%202/ratings.csv",
header = TRUE, sep =",", stringsAsFactors = FALSE)
kable(head(movies))
movieId | title | genres |
---|---|---|
1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
2 | Jumanji (1995) | Adventure|Children|Fantasy |
3 | Grumpier Old Men (1995) | Comedy|Romance |
4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
5 | Father of the Bride Part II (1995) | Comedy |
6 | Heat (1995) | Action|Crime|Thriller |
kable(head(ratings))
userId | movieId | rating | timestamp |
---|---|---|---|
1 | 31 | 2.5 | 1260759144 |
1 | 1029 | 3.0 | 1260759179 |
1 | 1061 | 3.0 | 1260759182 |
1 | 1129 | 2.0 | 1260759185 |
1 | 1172 | 4.0 | 1260759205 |
1 | 1263 | 2.0 | 1260759151 |
In order to create the matrix, both datasets had to be first downsized (for R to be able to handle) and then combined. The datasets with the most reviews and the most active users were preferable, so the data was downsized with that in mind. This resulted in the users with at least 20 reviews being included and the movies with at least 46 reviews. The dataset before being shaped into a matrix was formatted as such:
x = head(data.frame(table(ratings$userId) %>% sort(decreasing =T)), 10000)
colnames(x) = c("userId", "count")
ratex = merge(ratings , x, by="userId")
submovie = subset(movies, select = c("movieId", "title"))
subrating = subset(ratex, select = c("userId","movieId", "rating"))
prematrix = subset(merge(subrating, submovie, by="movieId"), select = c("userId","rating","title"))
y = head(data.frame(table(prematrix$title) %>% sort(decreasing =T)), 500)
colnames(y) = c("title", "count2")
prematrix2 = subset(merge(prematrix, y, by="title"), select = c("title","userId","rating"))
kable(head(prematrix2))
title | userId | rating |
---|---|---|
10 Things I Hate About You (1999) | 599 | 5.0 |
10 Things I Hate About You (1999) | 207 | 0.5 |
10 Things I Hate About You (1999) | 638 | 4.5 |
10 Things I Hate About You (1999) | 128 | 5.0 |
10 Things I Hate About You (1999) | 596 | 3.0 |
10 Things I Hate About You (1999) | 95 | 5.0 |
Originally, the matrix from the dataset was aimed to be slightly larger but due to formatting issues with the “realRatingMatrix”, the matrix was further reduced to a 500x500 size. It was later in the assignment that more formatting issues presented themselves (in the evaluation section), and the matrix was further reduced in size.
matrix = spread(prematrix2, title, rating)
ex = as.matrix((matrix[-c(501:670),-1]))
data = as(ex, "realRatingMatrix")
data = data[rowCounts(data) > 30, colCounts(data) > 50]
data
## 306 x 281 rating matrix of class 'realRatingMatrix' with 22003 ratings.
Despite the matrix having dimensions of 306 x 281, it still includes 22,003 ratings, which should be enough to create and evaluate the differences between recommender systems with varying algorithms.
From the heatmap below, we can see the top 30 rows and columns for the dataframe and how sparse the matrix is. The darker color is an indication of higher activity, meaning that the darker the rows, the higher the user’s ratings were across all movies.
image(data[rowCounts(data) > 150,
colCounts(data) > 150], main = "Heatmap of the Top Users
and Movies")
When looking at the ratings themselves, without taking into consideration the users or the movies, the majority of the ratings are 3-stars and above, with 4 stars being the most common grade.
# Ratings
qplot(getRatings(data), geom="histogram", main = "Histogram of Ratings", xlab = "Rating Scores", binwidth = 0.5, fill=I("cornflower blue"))
It is interesting to note that some of the results from the earlier histogram (particularly the frequency of awarding 5-star scores) is different when taking into consideration the movie. The average score per movie was primarily within the range of 3.5-4 stars. The other scores were significantly less, so much, in fact, that combining them into one category would not be enough to compete with the 3.5-star count. The presence of 0-2 star range is non-existant in this matrix.
# Ratings per Movie
qplot(colMeans(data), geom="histogram", main = "Histogram of Movie Ratings", xlab = "Rating Scores Per Movie", binwidth = 0.5, fill=I("FireBrick"))
Taking into consideration the users, the star-rating range goes as low as 2 stars and as high as 4.5. Even though, once again, the rating primarily granted is between the 3.5-4 star range, we notice that users are more lenient to give out 4.5 stars than could be assumed from the “Rating Scores Per Item” histogram.
# Ratings per User
qplot(rowMeans(data), geom="histogram", main = "Histogram of User Ratings", xlab = "Rating Scores Per User", binwidth = 0.5, fill=I("Plum"))
Though some users may give out low scores, it is not a usual case that one user consistently gives out low ratings (tangent topic: this may be used to help detect accounts for individual with the intention of sabotaging a recommender system). Overall, it appears that users typically rate movies as a 3.5 or 4 and it can be assummed that the users who typically rate movies highly (4.5 and above) either do so for movies with low ratings or do so for movies that are rated lower by a majority of other users (because the movies rating overall as a 4.5 is rather low).
# Splitting training and testing
values = sample(x = c(TRUE, FALSE), size = nrow(data),
replace = TRUE, prob = c(0.8, 0.2))
train = data[values, ]
test = data[!values, ]
Item-to-Item Collaborative Filtering takes into account the information of different users. By examining individuals’ ratings of certain items, it calculates the similarity between items and recommends them. The closer the items scores, the more likely the items are to be recommended.
Because the recommender system created is Item-to-Item, the “IBCF” method (Item Based Collaborative Filter method) was used, where the closeness between items was measured using the cosine.
item_item = Recommender(data = train, method = "IBCF",
parameter = list(k = 30, method = "Cosine"))
item_item
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 226 users.
A function was then created where inputing the user’s Id number returned a list of six recommended movies. Taking the first user (User 1), the six recommendations were made below:
# recommending
n_recommended = 6
predictions = predict(object = item_item, newdata = test,
n = n_recommended)
predictions
## Recommendations as 'topNList' with n = 6 for 80 users.
item_pred = function(idNum){
user_x = predictions@items[[idNum]]
user_x = predictions@itemLabels[user_x]
return(data.frame(user_x))
}
item_pred(1)
## user_x
## 1 Client, The (1994)
## 2 Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)
## 3 Finding Nemo (2003)
## 4 Game, The (1997)
## 5 In the Line of Fire (1993)
## 6 Piano, The (1993)
The top ten most recommended movies made overall by this recommender system are listed below in both the ordered table and the histogram:
recc_matrix = sapply(predictions@items, function(x){
colnames(data)[x]
})
number_of_items = data.frame(table(unlist(recc_matrix, use.names=FALSE)))
sorted = number_of_items[order(number_of_items$Freq, decreasing = T),][1:10,]
kable(sorted)
Var1 | Freq | |
---|---|---|
2 | Abyss, The (1989) | 7 |
5 | Addams Family Values (1993) | 7 |
13 | American Beauty (1999) | 7 |
67 | Dead Poets Society (1989) | 7 |
82 | Fifth Element, The (1997) | 7 |
14 | American History X (1998) | 6 |
16 | American President, The (1995) | 6 |
27 | Batman (1989) | 6 |
38 | Bourne Identity, The (2002) | 6 |
39 | Bourne Supremacy, The (2004) | 6 |
ggplot(data=sorted, aes(x=Var1, y=Freq)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=Freq), color="red", size=3)+
theme_minimal()+
xlab("Movie Titles")+
coord_flip()
User-to-User Collaborative Filtering takes into account the information of different users by examining the similarity between them and recommending items based on past ratings. The closer the users, the more likely a past item rated by one user (but not by another) is to be recommended.
Because the recommender system created is User-to-User, the “UBCF” method (User Based Collaborative Filter method) was used, where the closeness between users was measured using the cosine.
user_user = Recommender(data = train, method = "UBCF", parameter = list(k = 30, method = "Cosine"))
## Warning: Unknown parameters: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
user_user
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 226 users.
As done previously with the Item-to-Item system, a function was created where inputing the user’s Id number returned a list of six recommended movies. Again, taking the first user (User 1), for comparison purposes, the six recommendations were made below:
# recommending
n_recommended = 6
predictions2 = predict(object = user_user, newdata = test,
n = n_recommended)
predictions2
## Recommendations as 'topNList' with n = 6 for 80 users.
user_pred = function(idNum){
user_x = predictions2@items[[idNum]]
user_x = predictions2@itemLabels[user_x]
return(data.frame(user_x))
}
user_pred(1)
## user_x
## 1 Shawshank Redemption, The (1994)
## 2 Memento (2000)
## 3 Pianist, The (2002)
## 4 American Beauty (1999)
## 5 Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
## 6 Schindler's List (1993)
The top ten most recommended movies made overall by this recommender system are listed below in both the ordered table and the histogram:
recc_matrix2 = sapply(predictions2@items, function(x){
colnames(data)[x]
})
number_of_items = data.frame(table(unlist(recc_matrix2, use.names=FALSE)))
sorted = number_of_items[order(number_of_items$Freq, decreasing = T),][1:10,]
kable(sorted)
Var1 | Freq | |
---|---|---|
23 | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964) | 28 |
63 | Schindler’s List (1993) | 26 |
13 | Braveheart (1995) | 21 |
76 | Terminator 2: Judgment Day (1991) | 21 |
66 | Shawshank Redemption, The (1994) | 20 |
32 | Godfather, The (1972) | 19 |
44 | Lord of the Rings: The Fellowship of the Ring, The (2001) | 16 |
26 | Fight Club (1999) | 15 |
83 | Usual Suspects, The (1995) | 15 |
20 | Dances with Wolves (1990) | 14 |
ggplot(data=sorted, aes(x=Var1, y=Freq)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=Freq), color="red", size=3)+
theme_minimal()+
xlab("Movie Titles")+
coord_flip()
Just by looking at the results of the User-to-User system, it is interesting to note that the recommendations for the same user are different between this approach and the Item-to-Item method. In addition, the User-to-User approach appears to recommend the same movies more frequently. This may be reflective of how popular movies are viewed or recommended simply because they are popular.
In order to test the models’ accuracy, the first thing needed was to determine the minimum number of items rated by each user. The number of recommendations made in the training set was to be determined based on this (it must be a lower value).
set.seed(101)
minimum = min(rowCounts(data))
minimum
## [1] 18
An evaluation set was then created containing a training and testing set (testing included known and unknown sets). The k-fold validation approach was used (where chunks are taken out of the test set and each one is evaluated for accuracy) with k number of repetitions.
evaluation = evaluationScheme(data = data, method = "cross-validation", k = 10, given = 10, goodRating = 3.5)
evaluation
## Evaluation scheme with 10 items given
## Method: 'cross-validation' with 10 run(s).
## Good ratings: >=3.500000
## Data set: 306 x 281 rating matrix of class 'realRatingMatrix' with 22003 ratings.
ev_train = getData(evaluation, "train")
ev_known = getData(evaluation, "known")
ev_unknown = getData(evaluation, "unknown")
Using this evaluation method, the Item Based Collaborative Filter and the User Based Collaborative Filter methods were compared. A table including the Root Mean Square Errors (RMSE), the Mean Squared Errors (MSE) and the Mean Absolute Errors (MAE) is below to more easily compare the each method’s performance.
# Item
item_model = Recommender(data = ev_train, method = "IBCF", parameter = list(method = "Cosine"))
item_model_pred = predict(object = item_model, newdata = ev_known, n = 10, type = "ratings")
item = calcPredictionAccuracy(x = item_model_pred, data = ev_unknown, byUser = FALSE)
# User
user_model = Recommender(data = ev_train, method = "UBCF", parameter = list(method = "Cosine"))
user_model_pred = predict(object = user_model, newdata = ev_known, n = 10, type = "ratings")
user = calcPredictionAccuracy(x = user_model_pred, data = ev_unknown, byUser = FALSE)
# Comparison
kable(rbind(item, user))
RMSE | MSE | MAE | |
---|---|---|---|
item | 1.3651117 | 1.8635299 | 1.0252760 |
user | 0.9378846 | 0.8796276 | 0.7180434 |
Across all three levels (RMSE, MSE and MAE), the user-based method scored lower and, hence, performed better.
To get a closer look at the performance between the IBCF and UBCF methods, confusion matrices of the k-folds were created and are shown below.
eval_sets = evaluationScheme(data = data, method = "cross-validation", k = 4, given = 10, goodRating = 3.5)
I_results = evaluate(x = eval_sets, method = "IBCF", n = seq(10, 100, 10))
## IBCF run fold/sample [model time/prediction time]
## 1 [0.16sec/0.02sec]
## 2 [0.14sec/0.03sec]
## 3 [0.11sec/0.03sec]
## 4 [0.14sec/0.03sec]
kable(head(getConfusionMatrix(I_results)[[1]]))
TP | FP | FN | TN | precision | recall | TPR | FPR | |
---|---|---|---|---|---|---|---|---|
10 | 1.551282 | 8.448718 | 43.79487 | 217.2051 | 0.1551282 | 0.0350662 | 0.0350662 | 0.0373916 |
20 | 3.282051 | 16.717949 | 42.06410 | 208.9359 | 0.1641026 | 0.0733280 | 0.0733280 | 0.0739931 |
30 | 5.025641 | 24.974359 | 40.32051 | 200.6795 | 0.1675214 | 0.1120181 | 0.1120181 | 0.1107101 |
40 | 6.820513 | 33.179487 | 38.52564 | 192.4744 | 0.1705128 | 0.1497783 | 0.1497783 | 0.1468460 |
50 | 8.589744 | 41.410256 | 36.75641 | 184.2436 | 0.1717949 | 0.1854553 | 0.1854553 | 0.1828947 |
60 | 10.205128 | 49.794872 | 35.14103 | 175.8590 | 0.1700855 | 0.2250039 | 0.2250039 | 0.2202029 |
U_results = evaluate(x = eval_sets, method = "UBCF", n = seq(10, 100, 10))
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.08sec]
## 2 [0sec/0.1sec]
## 3 [0.02sec/0.08sec]
## 4 [0sec/0.11sec]
kable(head(getConfusionMatrix(U_results)[[1]]))
TP | FP | FN | TN | precision | recall | TPR | FPR | |
---|---|---|---|---|---|---|---|---|
10 | 3.192308 | 6.807692 | 42.15385 | 218.8462 | 0.3192308 | 0.0773929 | 0.0773929 | 0.0293254 |
20 | 5.666667 | 14.333333 | 39.67949 | 211.3205 | 0.2833333 | 0.1322142 | 0.1322142 | 0.0619863 |
30 | 7.858974 | 22.141026 | 37.48718 | 203.5128 | 0.2619658 | 0.1787146 | 0.1787146 | 0.0960339 |
40 | 9.615385 | 30.384615 | 35.73077 | 195.2692 | 0.2403846 | 0.2147292 | 0.2147292 | 0.1325856 |
50 | 11.294872 | 38.705128 | 34.05128 | 186.9487 | 0.2258974 | 0.2507996 | 0.2507996 | 0.1695192 |
60 | 13.038462 | 46.961539 | 32.30769 | 178.6923 | 0.2173077 | 0.2898020 | 0.2898020 | 0.2059630 |
It appears that, overall, the UBCF method has a higher chance of correctly predicting True Positives, but the IBCF has a higher chance of correctly predicting False Negatives. Also, the UBCF has an overall higher precision and recall than the IBCF.
The ROC plots of the IBCF and UBCF performance are R below for comparison.
plot(I_results, annotate = TRUE, main = "ROC curve of IBCF")
plot(U_results, annotate = TRUE, main = "ROC curve of UBCF")
Just from looking at the ROC curves, the performance between the two methods appears to have similar results. However, the UBCF approach has a higher TPR and a slight curve between the 10 and 40 mark, which gives it an edge over the IBCF method.
Beyond the IBCF-by-Cosine and UBCF-by-Cosine approach, there are other algorithms that can be used to measure distance between users or items and to make recommendations. The Pearson Correlation can be used as measure of distance instead of Cosine. Recommendations can be made at random. Recommendations can also be made based on which items are most popular. Each of these approaches (IBCF by Pearson, UBCF by Pearson, Random and Popular) can be evaluated and compared to the already measured IBCF by Cosine and UBCF by Cosine.
mult_models = list(
IBCF_cos = list(name = "IBCF", param = list(method = "Cosine")),
IBCF_pearson = list(name = "IBCF", param = list(method = "pearson")),
UBCF_cos = list(name = "UBCF", param = list(method = "Cosine")),
UBCF_pearson = list(name = "UBCF", param = list(method = "pearson")),
Random = list(name = "RANDOM", param = NULL),
Popular = list(name = "POPULAR", param = NULL)
)
# Testing models
models = evaluate(eval_sets, mult_models, n= c(1, 5, seq(10, 100, 10)))
## IBCF run fold/sample [model time/prediction time]
## 1 [0.15sec/0.02sec]
## 2 [0.19sec/0.04sec]
## 3 [0.16sec/0.03sec]
## 4 [0.15sec/0.01sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.29sec/0.02sec]
## 2 [0.16sec/0.03sec]
## 3 [0.15sec/0.03sec]
## 4 [0.16sec/0.03sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.08sec]
## 2 [0sec/0.09sec]
## 3 [0sec/0.1sec]
## 4 [0sec/0.09sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.11sec]
## 2 [0sec/0.21sec]
## 3 [0sec/0.09sec]
## 4 [0sec/0.08sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/0.03sec]
## 2 [0.01sec/0.05sec]
## 3 [0sec/0.04sec]
## 4 [0sec/0.05sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0sec/0.09sec]
## 2 [0.03sec/0.14sec]
## 3 [0sec/0.12sec]
## 4 [0.01sec/0.1sec]
The ROC and Precision/Recall plots below show the performance of each of the models.
# Plotting models
plot(models, annotate = T, legend="topleft")
plot(models, "prec/rec", annotate = F, main="Precision/Recall", legend="topright")
In both plots, the “Popular” model was surprisingly the one with the best performance. For both the IBCF and the UBCF methods, the Pearson correlation appeared to be the better approach in terms of measuring distance between items or users. In addition, the UBCF approach performed better than the IBCF approach, regardles of metric (Pearson or cosine). The random approach did not perform well, which was not unexpected, however, seeing the results side by side, it is interesting how the IBCF cosine scored just as low.
The best approach using this downsized MovieLens dataset was by recommending Popular movies. That being said, this may have been the result of the downsizing method since the most reviewed movies were the ones retained for evaluation purposes. If that is indeed the case, then the next best approach to consider is the User-to-User Collaborative Filtering algorithm (using the Pearson Correlation as a measurement of distance between users). When viewing the results by comparing RMSE, MSE and MAE values, the UBCF method performed better on all accounts. This was further noted on the ROC plots in the “Comparing Different Model Approaches” section, where the IBCF results were almost as low as randomly selecting movies. The reason for the downsizing being done the way it was can primarily be attributed to the need for guaranteeing a large enough pool of entries in a rather sparse matrix. It will be interesting to note whether these results hold up in a setting where users and movies weren’t specifically picked for their higher traffic in terms of ratings.