Presented By: Abinaya, Badrinath, Rajkumar
Date: 31st Jan 2016
Name of the Data Set: Movie Lense 100k Dataset
INTRODUCTION:
MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.
This data set consists of:
1. 100,000 ratings (1-5) from 943 users on 1682 movies.
2. Each user has rated at least 20 movies.
3. Simple demographic info for the users (age, gender, occupation, zip)
The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th,1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set.
REFERENCE:
Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic Framework for Performing Collaborative Filtering. Proceedings of the 1999 Conference on Research and Development in Information Retrieval. Aug. 1999.
SUMMARY ABOUT DATASET
No of Users Rated: 993
No of Films Rated: 1664
Total Ratings: 99392
Creating Recommender Models: Lets see the available models under the recommender systems
$IBCF_realRatingMatrix
Recommender method: IBCF
Description: Recommender based on item-based collaborative filtering (real data).
Parameters:
k method normalize normalize_sim_matrix alpha na_as_zero minRating
1 30 Cosine center FALSE 0.5 FALSE NA
$PCA_realRatingMatrix
Recommender method: PCA
Description: Recommender based on PCA approximation (real data).
Parameters:
categories method normalize normalize_sim_matrix alpha na_as_zero
1 20 Cosine center FALSE 0.5 FALSE
minRating
1 NA
$POPULAR_realRatingMatrix
Recommender method: POPULAR
Description: Recommender based on item popularity (real data).
Parameters: None
$RANDOM_realRatingMatrix
Recommender method: RANDOM
Description: Produce random recommendations (real ratings).
Parameters: None
$SVD_realRatingMatrix
Recommender method: SVD
Description: Recommender based on SVD approximation (real data).
Parameters:
categories method normalize normalize_sim_matrix alpha treat_na
1 50 Cosine center FALSE 0.5 median
minRating
1 NA
$UBCF_realRatingMatrix
Recommender method: UBCF
Description: Recommender based on user-based collaborative filtering (real data).
Parameters:
method nn sample normalize minRating
1 cosine 25 FALSE center NA
Method#1:“Popular Method” of Recommender Model for 500 Users
Recommender of type 'POPULAR' for 'realRatingMatrix'
learned using 500 users.
[1] "topN" "ratings" "normalize" "aggregation"
From the available 4 models from the POPULAR method, let do out some models out of it
Model#1:topN
Recommendations as 'topNList' with n = 1664 for 1 users.
Recommendation for 10 other users for 10 films based on our recommender model
Recommendations as 'topNList' with n = 10 for 10 users.
[[1]]
[1] "Star Wars (1977)" "Silence of the Lambs, The (1991)"
[3] "Raiders of the Lost Ark (1981)" "Pulp Fiction (1994)"
[5] "Shawshank Redemption, The (1994)" "Schindler's List (1993)"
[7] "Usual Suspects, The (1995)" "Empire Strikes Back, The (1980)"
[9] "Casablanca (1942)" "L.A. Confidential (1997)"
[[2]]
[1] "Star Wars (1977)" "Fargo (1996)"
[3] "Silence of the Lambs, The (1991)" "Godfather, The (1972)"
[5] "Raiders of the Lost Ark (1981)" "Pulp Fiction (1994)"
[7] "Shawshank Redemption, The (1994)" "Schindler's List (1993)"
[9] "Usual Suspects, The (1995)" "Empire Strikes Back, The (1980)"
[[3]]
[1] "Pulp Fiction (1994)"
[2] "Shawshank Redemption, The (1994)"
[3] "Casablanca (1942)"
[4] "L.A. Confidential (1997)"
[5] "One Flew Over the Cuckoo's Nest (1975)"
[6] "Braveheart (1995)"
[7] "Amadeus (1984)"
[8] "Good Will Hunting (1997)"
[9] "Blade Runner (1982)"
[10] "Contact (1997)"
[[4]]
[1] "Shawshank Redemption, The (1994)" "Titanic (1997)"
[3] "Usual Suspects, The (1995)" "Empire Strikes Back, The (1980)"
[5] "Casablanca (1942)" "L.A. Confidential (1997)"
[7] "Braveheart (1995)" "Princess Bride, The (1987)"
[9] "Amadeus (1984)" "Good Will Hunting (1997)"
[[5]]
[1] "Fargo (1996)"
[2] "Shawshank Redemption, The (1994)"
[3] "Schindler's List (1993)"
[4] "Usual Suspects, The (1995)"
[5] "Casablanca (1942)"
[6] "L.A. Confidential (1997)"
[7] "One Flew Over the Cuckoo's Nest (1975)"
[8] "Good Will Hunting (1997)"
[9] "Monty Python and the Holy Grail (1974)"
[10] "Rear Window (1954)"
[[6]]
[1] "Fargo (1996)"
[2] "Silence of the Lambs, The (1991)"
[3] "Godfather, The (1972)"
[4] "Shawshank Redemption, The (1994)"
[5] "Schindler's List (1993)"
[6] "Titanic (1997)"
[7] "Casablanca (1942)"
[8] "L.A. Confidential (1997)"
[9] "One Flew Over the Cuckoo's Nest (1975)"
[10] "Braveheart (1995)"
[[7]]
[1] "Fargo (1996)" "Silence of the Lambs, The (1991)"
[3] "Godfather, The (1972)" "Raiders of the Lost Ark (1981)"
[5] "Pulp Fiction (1994)" "Shawshank Redemption, The (1994)"
[7] "Schindler's List (1993)" "Usual Suspects, The (1995)"
[9] "Empire Strikes Back, The (1980)" "Casablanca (1942)"
[[8]]
[1] "Fargo (1996)" "Godfather, The (1972)"
[3] "Pulp Fiction (1994)" "Shawshank Redemption, The (1994)"
[5] "Titanic (1997)" "Usual Suspects, The (1995)"
[7] "Casablanca (1942)" "L.A. Confidential (1997)"
[9] "Braveheart (1995)" "Good Will Hunting (1997)"
[[9]]
[1] "Fargo (1996)" "Silence of the Lambs, The (1991)"
[3] "Godfather, The (1972)" "Raiders of the Lost Ark (1981)"
[5] "Pulp Fiction (1994)" "Shawshank Redemption, The (1994)"
[7] "Schindler's List (1993)" "Titanic (1997)"
[9] "Usual Suspects, The (1995)" "Empire Strikes Back, The (1980)"
[[10]]
[1] "Star Wars (1977)" "Fargo (1996)"
[3] "Silence of the Lambs, The (1991)" "Godfather, The (1972)"
[5] "Raiders of the Lost Ark (1981)" "Pulp Fiction (1994)"
[7] "Shawshank Redemption, The (1994)" "Schindler's List (1993)"
[9] "Usual Suspects, The (1995)" "Empire Strikes Back, The (1980)"
Recommendation as ‘topNList’ for 5 movies and 10 users based on above prediction
Recommendations as 'topNList' with n = 5 for 10 users.
[[1]]
[1] "Star Wars (1977)" "Silence of the Lambs, The (1991)"
[3] "Raiders of the Lost Ark (1981)" "Pulp Fiction (1994)"
[5] "Shawshank Redemption, The (1994)"
[[2]]
[1] "Star Wars (1977)" "Fargo (1996)"
[3] "Silence of the Lambs, The (1991)" "Godfather, The (1972)"
[5] "Raiders of the Lost Ark (1981)"
[[3]]
[1] "Pulp Fiction (1994)"
[2] "Shawshank Redemption, The (1994)"
[3] "Casablanca (1942)"
[4] "L.A. Confidential (1997)"
[5] "One Flew Over the Cuckoo's Nest (1975)"
[[4]]
[1] "Shawshank Redemption, The (1994)" "Titanic (1997)"
[3] "Usual Suspects, The (1995)" "Empire Strikes Back, The (1980)"
[5] "Casablanca (1942)"
[[5]]
[1] "Fargo (1996)" "Shawshank Redemption, The (1994)"
[3] "Schindler's List (1993)" "Usual Suspects, The (1995)"
[5] "Casablanca (1942)"
[[6]]
[1] "Fargo (1996)" "Silence of the Lambs, The (1991)"
[3] "Godfather, The (1972)" "Shawshank Redemption, The (1994)"
[5] "Schindler's List (1993)"
[[7]]
[1] "Fargo (1996)" "Silence of the Lambs, The (1991)"
[3] "Godfather, The (1972)" "Raiders of the Lost Ark (1981)"
[5] "Pulp Fiction (1994)"
[[8]]
[1] "Fargo (1996)" "Godfather, The (1972)"
[3] "Pulp Fiction (1994)" "Shawshank Redemption, The (1994)"
[5] "Titanic (1997)"
[[9]]
[1] "Fargo (1996)" "Silence of the Lambs, The (1991)"
[3] "Godfather, The (1972)" "Raiders of the Lost Ark (1981)"
[5] "Pulp Fiction (1994)"
[[10]]
[1] "Star Wars (1977)" "Fargo (1996)"
[3] "Silence of the Lambs, The (1991)" "Godfather, The (1972)"
[5] "Raiders of the Lost Ark (1981)"
Model#2: Ratings - we will predict Ratings of films from our model for next 10 films
10 x 1664 rating matrix of class 'realRatingMatrix' with 14963 ratings.
Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995)
501 0.2732034 -0.3586763 -0.3781417 0.06586568
502 0.2732034 -0.3586763 -0.3781417 0.06586568
503 NA -0.3586763 -0.3781417 0.06586568
504 0.2732034 -0.3586763 -0.3781417 NA
505 NA -0.3586763 -0.3781417 0.06586568
506 0.2732034 NA -0.3781417 0.06586568
507 0.2732034 -0.3586763 -0.3781417 0.06586568
508 NA -0.3586763 -0.3781417 0.06586568
509 0.2732034 -0.3586763 -0.3781417 0.06586568
510 0.2732034 -0.3586763 -0.3781417 0.06586568
Copycat (1995) Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
501 -0.2864243 0.2475417
502 -0.2864243 0.2475417
503 -0.2864243 0.2475417
504 NA 0.2475417
505 -0.2864243 0.2475417
506 NA 0.2475417
507 -0.2864243 0.2475417
508 -0.2864243 0.2475417
509 -0.2864243 0.2475417
510 -0.2864243 0.2475417
Twelve Monkeys (1995) Babe (1995) Dead Man Walking (1995)
501 NA 0.4471136 0.394184
502 0.1980479 0.4471136 0.394184
503 0.1980479 NA 0.394184
504 0.1980479 0.4471136 NA
505 NA 0.4471136 0.394184
506 0.1980479 NA 0.394184
507 0.1980479 0.4471136 0.394184
508 0.1980479 0.4471136 0.394184
509 0.1980479 0.4471136 0.394184
510 0.1980479 0.4471136 0.394184
Richard III (1995)
501 0.3051623
502 0.3051623
503 NA
504 0.3051623
505 0.3051623
506 NA
507 0.3051623
508 0.3051623
509 0.3051623
510 0.3051623
Lets do Evaluvation for our model with method “Split”
Evaluation scheme with 15 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.750
Good ratings: >=4.000000
Data set: 500 x 1664 rating matrix of class 'realRatingMatrix' with 56431 ratings.
Lets Create TWO Reccommenders: USER BASED & ITEM BASED COLLABRATIVE FILTERING by using the above training data
Method# 2
USER BASED COLLABRATIVE FILTERING (UBCF)
Recommender of type 'UBCF' for 'realRatingMatrix'
learned using 375 users.
Method# 3
ITEM BASED COLLABRATIVE FILTERING (IBCF)
Recommender of type 'IBCF' for 'realRatingMatrix'
learned using 375 users.
Predict ratings for the known part of test data using above two algorithms
Model#1 (Ratings): Using training data given 15 Items for each user (Evaluvation) & R1 “UBCF”
125 x 1664 rating matrix of class 'realRatingMatrix' with 206125 ratings.
Model#2 (Ratings): Using training data given 15 Items for each user (Evaluvation) & R2 “IBCF”
125 x 1664 rating matrix of class 'realRatingMatrix' with 35735 ratings.
Finding errors between above 2 algorithms
RMSE MSE MAE
UBCF 1.096151 1.201546 0.8800528
IBCF 1.203950 1.449496 0.8558741
Evaluation of a top-N recommender algorithm 4-fold cross validation scheme with the the Given-3 protocol, i.e., for the test users all but three randomly selected items are withheld for evaluation
Evaluation scheme with 3 items given
Method: 'cross-validation' with 4 run(s).
Good ratings: >=4.000000
Data set: 500 x 1664 rating matrix of class 'realRatingMatrix' with 56431 ratings.
We use the created evaluation scheme to evaluate the recommender method popular
We evaluate top-1, top-3, top-5, top-10, top-15 and top-20 recommendation lists.
POPULAR run
1 [0.03sec/0.28sec]
2 [0.02sec/0.25sec]
3 [0.02sec/0.27sec]
4 [0.02sec/0.25sec]
Evaluation results for 4 runs using method 'POPULAR'.
Confusion Matrix for above
TP FP FN TN precision recall TPR
1 0.536 0.464 56.008 1603.992 0.5360000 0.01321655 0.01321655
3 1.408 1.592 55.136 1602.864 0.4693333 0.03200618 0.03200618
5 1.944 3.056 54.600 1601.400 0.3888000 0.04081173 0.04081173
10 3.520 6.480 53.024 1597.976 0.3520000 0.07073502 0.07073502
15 4.880 10.120 51.664 1594.336 0.3253333 0.09563071 0.09563071
20 5.952 14.048 50.592 1590.408 0.2976000 0.12089959 0.12089959
FPR
1 0.0002849276
3 0.0009801296
5 0.0018849863
10 0.0039971671
15 0.0062472511
20 0.0086845888
Averaging out the Confusion Matrix for above
TP FP FN TN precision recall TPR
1 0.532 0.468 60.918 1599.082 0.5320000 0.01204009 0.01204009
3 1.418 1.582 60.032 1597.968 0.4726667 0.02957921 0.02957921
5 2.074 2.926 59.376 1596.624 0.4148000 0.04052043 0.04052043
10 3.670 6.330 57.780 1593.220 0.3670000 0.07086318 0.07086318
15 5.112 9.888 56.338 1589.662 0.3408000 0.10122636 0.10122636
20 6.322 13.678 55.128 1585.872 0.3161000 0.12356668 0.12356668
FPR
1 0.0002876412
3 0.0009744666
5 0.0018043129
10 0.0039090485
15 0.0061149169
20 0.0084660350
Plotting the results
Now lets do Comparision of the methods
Evaluation scheme with 5 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.750
Good ratings: >=4.000000
Data set: 500 x 1664 rating matrix of class 'realRatingMatrix' with 56431 ratings.
Lets run the algorithms
RANDOM run
1 [0sec/1.39sec]
POPULAR run
1 [0.01sec/0.38sec]
UBCF run
1 [0.02sec/42.62sec]
List of evaluation results for 3 recommenders:
Evaluation results for 1 runs using method 'RANDOM'.
Evaluation results for 1 runs using method 'POPULAR'.
Evaluation results for 1 runs using method 'UBCF'.
[1] "random items" "popular items" "user-based CF"
Evaluation results for 1 runs using method 'UBCF'.
Plotting the results
For this data set and the given evaluation scheme the user-based and item-based CF methods has outperformed compared to other models
In TPR graph we see that they dominate the other method since for each length of top-N list they provide a better combination of TPR and FPR
Hence, we will check with less information, for comparison we will check how the algorithms compare given less information
809 x 1664 rating matrix of class 'binaryRatingMatrix' with 79787 ratings.
Evaluation scheme with 20 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.750
Good ratings: NA
Data set: 500 x 1664 rating matrix of class 'binaryRatingMatrix' with 51862 ratings.
RANDOM run
1 [0sec/1.52sec]
POPULAR run
1 [0sec/1.91sec]
UBCF run
1 [0sec/5.73sec]
Plotting the results
Conclusion
From our initial models, its is understood that both IBCF & UBCF models work better when compared to other given model, however based on our latest inputs with given lesser information for our model, the UBCF perfroms better than the other models
We conclude that, both UBCF and IBCF models performs well for this dataset.