Some comments about the dataset:
library(recommenderlab)
library(tidyverse)
library(kableExtra)
Import files
The files are too large to upload to my github even when zipped so you can find it here.
d_ratings <- read.csv("C:/Users/Javern/Documents/Data Science MS/DATA612/libimseti-complete/libimseti/ratings.dat", header = F, sep = ",")
colnames(d_ratings) <- c("userId", "profileId", "rating")
d_ratings$userId <- as.factor(d_ratings$userId)
d_ratings$profileId <- as.factor(d_ratings$profileId)
gender <- read.csv("C:/Users/Javern/Documents/Data Science MS/DATA612/libimseti-complete/libimseti/gender.dat", header = F, sep = ",")
colnames(gender) <- c("userId", "type")
gender$userId <- as.factor(gender$userId)
head(d_ratings); head(gender)
d_ratings %>%
group_by(userId)%>%
summarise(user_count = length(userId)) %>%
top_n(10) %>%
ggplot(aes(userId, user_count, fill = userId)) + geom_bar(stat = "identity") + geom_text(aes(label=user_count), vjust=-0.3, size=3.5)
package <U+393C><U+3E31>bindrcpp<U+393C><U+3E32> was built under R version 3.5.2Selecting by user_count
d_ratings %>%
group_by(profileId) %>%
summarise(profile_count = length(profileId)) %>%
top_n(10) -> topprofiles
Selecting by profile_count
ggplot(topprofiles, aes(profileId, profile_count, fill = profileId)) +
geom_bar(stat = "identity") +
geom_text(aes(label=profile_count), vjust=-0.3, size=3.5)
d_ratings %>%
group_by(rating)%>%
summarise(rating_count = sum(rating)) %>%
ggplot(aes(rating, rating_count, fill = rating)) + geom_bar(stat = "identity") + scale_x_continuous(breaks = seq(1, 10)) + geom_text(aes(label=rating_count), vjust=-0.3, size=3.5)
gender %>%
group_by(type) %>%
summarise(type_count = length(type)) %>%
ggplot(aes(type, type_count, fill = type)) + geom_bar(stat = "identity", color = "purple")
NA
As we move forward in the analysis, the ratings dataframe will be converted into a matrix to build and evaluate the recommendation systems.
dmatrix <- as(d_ratings, "realRatingMatrix")
dmatrix
135359 x 168791 rating matrix of class realRatingMatrix with 17359346 ratings.
Dimensions of the matrix
dim(dmatrix@data)
[1] 135359 168791
Size of matrix data
object.size(dmatrix)
228455328 bytes
About 228 MB
The file is large so we’ll cut down on the number of attributes.
# users who rated at least 500 profiles
# profiles that are rated at least 1000 times
dmatrix <- dmatrix[rowCounts(dmatrix) > 500, colCounts(dmatrix) > 800]
dmatrix
3894 x 3391 rating matrix of class realRatingMatrix with 1134236 ratings.
avg_profile_ratings <- data.frame("avg_rating" = colMeans(dmatrix)) %>%
ggplot(aes(x = avg_rating)) +
geom_histogram(color = "red", fill = "lightblue") +
ggtitle("Distribution of Average Ratings for Profiles")
avg_profile_ratings
The distribution is nearly normal with most rating falling between 5 and 7.5.
sim <- similarity(dmatrix[1:300, ], method = "cosine", which = "users")
image(as.matrix(sim), main = "User Similarity")
sim2 <- similarity(dmatrix[, 1:300], method = "cosine", which = "items")
image(as.matrix(sim2), main = "Profile Similarity")
So we are going to split the data 90:10, train and test respectively, keeping 3 items and running the evaluation 4 times.
#min(rowCounts(dmatrix))= 6 so we can keep 5 items per user
dmat_eval <- evaluationScheme(data = dmatrix, method = "split", train = 0.9, given = 5, goodRating = 5, k = 4)
dmat_eval
Evaluation scheme with 5 items given
Method: split with 4 run(s).
Training set proportion: 0.900
Good ratings: >=5.000000
Data set: 3894 x 3391 rating matrix of class realRatingMatrix with 1134236 ratings.
algorithms <- list(
IBCF = list(name = "IBCF", param = list(method = "cosine")),
UBCF = list(name = "UBCF", param = list(method = "cosine")),
SVD = list(name = "SVD", param = list(k = 30)),
POPULAR = list(name = "POPULAR", param = NULL), #serendipity
RANDOM = list(name = "RANDOM", param = NULL)
)
Test the models by varying the number of profiles to recommend.
# run algorithms, predict next n profile
eval_results <- evaluate(dmat_eval, algorithms, type = "topNList", n = c(1, 3, 5, 10, 15, 20))
IBCF run fold/sample [model time/prediction time]
1 [497.85sec/0.43sec]
2 [493.34sec/0.41sec]
3 [492.62sec/0.29sec]
4 [495.82sec/0.2sec]
UBCF run fold/sample [model time/prediction time]
1 [0.17sec/41.13sec]
2 [0.17sec/42.32sec]
3 [0.16sec/39.14sec]
4 [0.23sec/40.22sec]
SVD run fold/sample [model time/prediction time]
1 [3.33sec/1.24sec]
2 [3.31sec/1.13sec]
3 [3.58sec/1.27sec]
4 [3.45sec/1.17sec]
POPULAR run fold/sample [model time/prediction time]
1 [0.19sec/4.05sec]
2 [0.19sec/4.03sec]
3 [0.22sec/3.88sec]
4 [0.2sec/3.89sec]
RANDOM run fold/sample [model time/prediction time]
1 [0sec/1.28sec]
2 [0sec/1.11sec]
3 [0sec/1.28sec]
4 [0sec/1.28sec]
averages <- avg(eval_results)
IBCF took approximately 25 minutes to run and do prediction.
TP - True Positive FP - False Positive FN - False Negative TN - True Negative
kable(averages$IBCF) %>% kable_styling(bootstrap_options = c("striped", "bordered"), full_width = F, font_size = 11) %>% add_header_above(c(" ", "IBCF" = 8))
TP | FP | FN | TN | precision | recall | TPR | FPR | |
---|---|---|---|---|---|---|---|---|
1 | 0.0179487 | 0.975641 | 222.5199 | 3162.487 | 0.0180648 | 0.0000925 | 0.0000925 | 0.0003090 |
3 | 0.0596154 | 2.882051 | 222.4782 | 3160.580 | 0.0200050 | 0.0002740 | 0.0002740 | 0.0009123 |
5 | 0.0967949 | 4.723718 | 222.4410 | 3158.738 | 0.0195725 | 0.0004463 | 0.0004463 | 0.0014955 |
10 | 0.1794872 | 9.063462 | 222.3583 | 3154.399 | 0.0184412 | 0.0007813 | 0.0007813 | 0.0028689 |
15 | 0.2717949 | 13.057692 | 222.2660 | 3150.404 | 0.0190787 | 0.0011761 | 0.0011761 | 0.0041339 |
20 | 0.3391026 | 16.832051 | 222.1987 | 3146.630 | 0.0182538 | 0.0014670 | 0.0014670 | 0.0053300 |
kable(averages$UBCF) %>% kable_styling(bootstrap_options = c("striped", "bordered"), full_width = F, font_size = 11) %>% add_header_above(c(" ", "UBCF" = 8))
TP | FP | FN | TN | precision | recall | TPR | FPR | |
---|---|---|---|---|---|---|---|---|
1 | 0.449359 | 0.5487179 | 222.0885 | 3162.913 | 0.4502025 | 0.0024331 | 0.0024331 | 0.0001721 |
3 | 1.327564 | 1.6666667 | 221.2103 | 3161.796 | 0.4433765 | 0.0070320 | 0.0070320 | 0.0005221 |
5 | 2.157692 | 2.8326923 | 220.3801 | 3160.629 | 0.4323913 | 0.0113597 | 0.0113597 | 0.0008872 |
10 | 4.025000 | 5.9557692 | 218.5128 | 3157.506 | 0.4033029 | 0.0208295 | 0.0208295 | 0.0018655 |
15 | 5.716667 | 9.2544872 | 216.8212 | 3154.208 | 0.3818708 | 0.0289989 | 0.0289989 | 0.0028990 |
20 | 7.313461 | 12.6480769 | 215.2244 | 3150.814 | 0.3664033 | 0.0366339 | 0.0366339 | 0.0039625 |
kable(averages$SVD) %>% kable_styling(bootstrap_options = c("striped", "bordered"), full_width = F, font_size = 11) %>% add_header_above(c(" ", "SVD" = 8))
TP | FP | FN | TN | precision | recall | TPR | FPR | |
---|---|---|---|---|---|---|---|---|
1 | 0.2500000 | 0.750000 | 222.2878 | 3162.712 | 0.2500000 | 0.0014511 | 0.0014511 | 0.0002370 |
3 | 0.7108974 | 2.289103 | 221.8269 | 3161.173 | 0.2369658 | 0.0039949 | 0.0039949 | 0.0007227 |
5 | 1.1538462 | 3.846154 | 221.3840 | 3159.616 | 0.2307692 | 0.0064196 | 0.0064196 | 0.0012140 |
10 | 2.2198718 | 7.780128 | 220.3179 | 3155.682 | 0.2219872 | 0.0124932 | 0.0124932 | 0.0024563 |
15 | 3.2980769 | 11.701923 | 219.2397 | 3151.760 | 0.2198718 | 0.0183008 | 0.0183008 | 0.0036935 |
20 | 4.3525641 | 15.647436 | 218.1853 | 3147.815 | 0.2176282 | 0.0239089 | 0.0239089 | 0.0049379 |
kable(averages$POPULAR) %>% kable_styling(bootstrap_options = c("striped", "bordered"), full_width = F, font_size = 11) %>% add_header_above(c(" ", "POPULAR" = 8))
TP | FP | FN | TN | precision | recall | TPR | FPR | |
---|---|---|---|---|---|---|---|---|
1 | 0.4057692 | 0.5942308 | 222.1321 | 3162.868 | 0.4057692 | 0.0018452 | 0.0018452 | 0.0001856 |
3 | 1.2275641 | 1.7724359 | 221.3103 | 3161.690 | 0.4091880 | 0.0058259 | 0.0058259 | 0.0005546 |
5 | 2.0128205 | 2.9871795 | 220.5250 | 3160.475 | 0.4025641 | 0.0092910 | 0.0092910 | 0.0009331 |
10 | 3.7782051 | 6.2217949 | 218.7596 | 3157.240 | 0.3778205 | 0.0175165 | 0.0175165 | 0.0019462 |
15 | 5.4551282 | 9.5448718 | 217.0827 | 3153.917 | 0.3636752 | 0.0253210 | 0.0253210 | 0.0029850 |
20 | 7.1230769 | 12.8769231 | 215.4147 | 3150.585 | 0.3561538 | 0.0322594 | 0.0322594 | 0.0040242 |
kable(averages$RANDOM) %>% kable_styling(bootstrap_options = c("striped", "bordered"), full_width = F, font_size = 11) %>% add_header_above(c(" ", "RANDOM" = 8))
TP | FP | FN | TN | precision | recall | TPR | FPR | |
---|---|---|---|---|---|---|---|---|
1 | 0.0692308 | 0.9307692 | 222.4686 | 3162.531 | 0.0692308 | 0.0003498 | 0.0003498 | 0.0002944 |
3 | 0.2198718 | 2.7801282 | 222.3179 | 3160.682 | 0.0732906 | 0.0010129 | 0.0010129 | 0.0008787 |
5 | 0.3660256 | 4.6339744 | 222.1718 | 3158.828 | 0.0732051 | 0.0016633 | 0.0016633 | 0.0014648 |
10 | 0.7179487 | 9.2820513 | 221.8199 | 3154.180 | 0.0717949 | 0.0033660 | 0.0033660 | 0.0029348 |
15 | 1.0525641 | 13.9474359 | 221.4853 | 3149.515 | 0.0701709 | 0.0049277 | 0.0049277 | 0.0044104 |
20 | 1.4205128 | 18.5794872 | 221.1173 | 3144.883 | 0.0710256 | 0.0066586 | 0.0066586 | 0.0058743 |
The ROC curve is created by plotting the true positive rate against the false positive rate. The closer an ROC curve is to the upper left corner, the more efficient is the test.
plot(eval_results, annotate = T, legend="topleft")
title("ROC Curve")
Based on the graph visualization above, the UBCF is better than the others.
Precision expresses the proportion of the data points our model says was relevant and are actually were relevant. Calculated as: TP / (TP + FP)
Recall expresses the ability to find all relevant instances in a dataset or the model’s ability to find all the data points of interest in a dataset. TP / (TP + FN)
The closer the curve or line is to the top right, the better the performance of the algorithm.
# precision / recall
plot(eval_results, "prec/rec", annotate = 2)
title("Precision-recall")
User Based model is still better than the other algorithms.
#Predict top-N recommendation lists
eval_results2 <- evaluate(dmat_eval, algorithms, type = "ratings")
IBCF run fold/sample [model time/prediction time]
1 [493.46sec/0.12sec]
2 [505.67sec/0.14sec]
3 [500.72sec/0.12sec]
4 [499.53sec/0.14sec]
UBCF run fold/sample [model time/prediction time]
1 [0.17sec/41.48sec]
2 [0.17sec/40.88sec]
3 [0.19sec/39.89sec]
4 [0.17sec/41.02sec]
SVD run fold/sample [model time/prediction time]
1 [3.41sec/0.93sec]
2 [3.33sec/0.74sec]
3 [3.77sec/0.7sec]
4 [3.37sec/0.63sec]
POPULAR run fold/sample [model time/prediction time]
1 [0.2sec/0.44sec]
2 [0.35sec/0.36sec]
3 [0.19sec/0.44sec]
4 [0.19sec/0.41sec]
RANDOM run fold/sample [model time/prediction time]
1 [0.01sec/0.74sec]
2 [0sec/0.87sec]
3 [0sec/0.88sec]
4 [0.01sec/0.79sec]
avg(eval_results2)
$`IBCF`
RMSE MSE MAE
res 2.487833 6.196339 1.179454
$UBCF
RMSE MSE MAE
res 2.784005 7.75109 2.235006
$SVD
RMSE MSE MAE
res 2.986838 8.92139 2.420337
$POPULAR
RMSE MSE MAE
res 2.187636 4.786568 1.673285
$RANDOM
RMSE MSE MAE
res 3.707756 13.7477 2.933583
As noted, offline evaluations use precompiled offline datasets from which data was removed hence evaluations are used to analyze the algorthims’ ablity to predict missing data. On the other hand, in online evaluations, recommendations are shown to real users of the system during their session and so users do not rate recommendations but the recommender system observes how often a user accepts a recommendation. One metric that could be used to evaluate online evaluation is click-through rate (CTR). This measures the ratio of clicks to the number of recommendation lists provided. So for instance if the recommender recommends 1000 profiles and the user clicks only 5 then the CTR would be 0.5%.
Based on the output shown for each algorithms, the IBCF took a longer time to learn the data than the others but took a shorter time to predict the data. According to the error rates, Popular has the lowest RMSE which means that it performed the best with lower error rates. However, as shown in the visualizations, UBCF performed best. The Random algorithm continued to perform worst through out the process.
https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c https://acutecaretesting.org/en/articles/roc-curves-what-are-they-and-how-are-they-used https://pdfs.semanticscholar.org/94c2/00cec1e2f9547ea6063e08019f72895bfba8.pdf https://link.springer.com/article/10.1007/s13042-017-0762-9