PROJECT ON RECOMMENDER SYSTEM

Presented By: Abinaya, Badrinath, Rajkumar

Date: 31st Jan 2016

Name of the Data Set: Movie Lense 100k Dataset

INTRODUCTION:

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.   
 
This data set consists of:   

1. 100,000 ratings (1-5) from 943 users on 1682 movies.   
2. Each user has rated at least 20 movies.    
3. Simple demographic info for the users (age, gender, occupation, zip)   

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th,1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set.

REFERENCE:

Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic Framework for Performing Collaborative Filtering. Proceedings of the 1999 Conference on Research and Development in Information Retrieval. Aug. 1999.

SUMMARY ABOUT DATASET

No of Users Rated: 993    
No of Films Rated: 1664    
Total Ratings:     99392

Creating Recommender Models: Lets see the available models under the recommender systems

$IBCF_realRatingMatrix
Recommender method: IBCF
Description: Recommender based on item-based collaborative filtering (real data).
Parameters:
   k method normalize normalize_sim_matrix alpha na_as_zero minRating
1 30 Cosine    center                FALSE   0.5      FALSE        NA

$PCA_realRatingMatrix
Recommender method: PCA
Description: Recommender based on PCA approximation (real data).
Parameters:
  categories method normalize normalize_sim_matrix alpha na_as_zero
1         20 Cosine    center                FALSE   0.5      FALSE
  minRating
1        NA

$POPULAR_realRatingMatrix
Recommender method: POPULAR
Description: Recommender based on item popularity (real data).
Parameters: None

$RANDOM_realRatingMatrix
Recommender method: RANDOM
Description: Produce random recommendations (real ratings).
Parameters: None

$SVD_realRatingMatrix
Recommender method: SVD
Description: Recommender based on SVD approximation (real data).
Parameters:
  categories method normalize normalize_sim_matrix alpha treat_na
1         50 Cosine    center                FALSE   0.5   median
  minRating
1        NA

$UBCF_realRatingMatrix
Recommender method: UBCF
Description: Recommender based on user-based collaborative filtering (real data).
Parameters:
  method nn sample normalize minRating
1 cosine 25  FALSE    center        NA

LETS PROCEED TO DO OUR RECOMMENDER SYSTEM BY TAKING 500 SAMPLE USERS

Method#1:“Popular Method” of Recommender Model for 500 Users

Recommender of type 'POPULAR' for 'realRatingMatrix' 
learned using 500 users.

[1] "topN"        "ratings"     "normalize"   "aggregation"

From the available 4 models from the POPULAR method, let do out some models out of it

Model#1:topN

Recommendations as 'topNList' with n = 1664 for 1 users.

Recommendation for 10 other users for 10 films based on our recommender model

Recommendations as 'topNList' with n = 10 for 10 users.

[[1]]
 [1] "Star Wars (1977)"                 "Silence of the Lambs, The (1991)"
 [3] "Raiders of the Lost Ark (1981)"   "Pulp Fiction (1994)"             
 [5] "Shawshank Redemption, The (1994)" "Schindler's List (1993)"         
 [7] "Usual Suspects, The (1995)"       "Empire Strikes Back, The (1980)" 
 [9] "Casablanca (1942)"                "L.A. Confidential (1997)"        

[[2]]
 [1] "Star Wars (1977)"                 "Fargo (1996)"                    
 [3] "Silence of the Lambs, The (1991)" "Godfather, The (1972)"           
 [5] "Raiders of the Lost Ark (1981)"   "Pulp Fiction (1994)"             
 [7] "Shawshank Redemption, The (1994)" "Schindler's List (1993)"         
 [9] "Usual Suspects, The (1995)"       "Empire Strikes Back, The (1980)" 

[[3]]
 [1] "Pulp Fiction (1994)"                   
 [2] "Shawshank Redemption, The (1994)"      
 [3] "Casablanca (1942)"                     
 [4] "L.A. Confidential (1997)"              
 [5] "One Flew Over the Cuckoo's Nest (1975)"
 [6] "Braveheart (1995)"                     
 [7] "Amadeus (1984)"                        
 [8] "Good Will Hunting (1997)"              
 [9] "Blade Runner (1982)"                   
[10] "Contact (1997)"                        

[[4]]
 [1] "Shawshank Redemption, The (1994)" "Titanic (1997)"                  
 [3] "Usual Suspects, The (1995)"       "Empire Strikes Back, The (1980)" 
 [5] "Casablanca (1942)"                "L.A. Confidential (1997)"        
 [7] "Braveheart (1995)"                "Princess Bride, The (1987)"      
 [9] "Amadeus (1984)"                   "Good Will Hunting (1997)"        

[[5]]
 [1] "Fargo (1996)"                          
 [2] "Shawshank Redemption, The (1994)"      
 [3] "Schindler's List (1993)"               
 [4] "Usual Suspects, The (1995)"            
 [5] "Casablanca (1942)"                     
 [6] "L.A. Confidential (1997)"              
 [7] "One Flew Over the Cuckoo's Nest (1975)"
 [8] "Good Will Hunting (1997)"              
 [9] "Monty Python and the Holy Grail (1974)"
[10] "Rear Window (1954)"                    

[[6]]
 [1] "Fargo (1996)"                          
 [2] "Silence of the Lambs, The (1991)"      
 [3] "Godfather, The (1972)"                 
 [4] "Shawshank Redemption, The (1994)"      
 [5] "Schindler's List (1993)"               
 [6] "Titanic (1997)"                        
 [7] "Casablanca (1942)"                     
 [8] "L.A. Confidential (1997)"              
 [9] "One Flew Over the Cuckoo's Nest (1975)"
[10] "Braveheart (1995)"                     

[[7]]
 [1] "Fargo (1996)"                     "Silence of the Lambs, The (1991)"
 [3] "Godfather, The (1972)"            "Raiders of the Lost Ark (1981)"  
 [5] "Pulp Fiction (1994)"              "Shawshank Redemption, The (1994)"
 [7] "Schindler's List (1993)"          "Usual Suspects, The (1995)"      
 [9] "Empire Strikes Back, The (1980)"  "Casablanca (1942)"               

[[8]]
 [1] "Fargo (1996)"                     "Godfather, The (1972)"           
 [3] "Pulp Fiction (1994)"              "Shawshank Redemption, The (1994)"
 [5] "Titanic (1997)"                   "Usual Suspects, The (1995)"      
 [7] "Casablanca (1942)"                "L.A. Confidential (1997)"        
 [9] "Braveheart (1995)"                "Good Will Hunting (1997)"        

[[9]]
 [1] "Fargo (1996)"                     "Silence of the Lambs, The (1991)"
 [3] "Godfather, The (1972)"            "Raiders of the Lost Ark (1981)"  
 [5] "Pulp Fiction (1994)"              "Shawshank Redemption, The (1994)"
 [7] "Schindler's List (1993)"          "Titanic (1997)"                  
 [9] "Usual Suspects, The (1995)"       "Empire Strikes Back, The (1980)" 

[[10]]
 [1] "Star Wars (1977)"                 "Fargo (1996)"                    
 [3] "Silence of the Lambs, The (1991)" "Godfather, The (1972)"           
 [5] "Raiders of the Lost Ark (1981)"   "Pulp Fiction (1994)"             
 [7] "Shawshank Redemption, The (1994)" "Schindler's List (1993)"         
 [9] "Usual Suspects, The (1995)"       "Empire Strikes Back, The (1980)"

Recommendation as ‘topNList’ for 5 movies and 10 users based on above prediction

Recommendations as 'topNList' with n = 5 for 10 users.

[[1]]
[1] "Star Wars (1977)"                 "Silence of the Lambs, The (1991)"
[3] "Raiders of the Lost Ark (1981)"   "Pulp Fiction (1994)"             
[5] "Shawshank Redemption, The (1994)"

[[2]]
[1] "Star Wars (1977)"                 "Fargo (1996)"                    
[3] "Silence of the Lambs, The (1991)" "Godfather, The (1972)"           
[5] "Raiders of the Lost Ark (1981)"  

[[3]]
[1] "Pulp Fiction (1994)"                   
[2] "Shawshank Redemption, The (1994)"      
[3] "Casablanca (1942)"                     
[4] "L.A. Confidential (1997)"              
[5] "One Flew Over the Cuckoo's Nest (1975)"

[[4]]
[1] "Shawshank Redemption, The (1994)" "Titanic (1997)"                  
[3] "Usual Suspects, The (1995)"       "Empire Strikes Back, The (1980)" 
[5] "Casablanca (1942)"               

[[5]]
[1] "Fargo (1996)"                     "Shawshank Redemption, The (1994)"
[3] "Schindler's List (1993)"          "Usual Suspects, The (1995)"      
[5] "Casablanca (1942)"               

[[6]]
[1] "Fargo (1996)"                     "Silence of the Lambs, The (1991)"
[3] "Godfather, The (1972)"            "Shawshank Redemption, The (1994)"
[5] "Schindler's List (1993)"         

[[7]]
[1] "Fargo (1996)"                     "Silence of the Lambs, The (1991)"
[3] "Godfather, The (1972)"            "Raiders of the Lost Ark (1981)"  
[5] "Pulp Fiction (1994)"             

[[8]]
[1] "Fargo (1996)"                     "Godfather, The (1972)"           
[3] "Pulp Fiction (1994)"              "Shawshank Redemption, The (1994)"
[5] "Titanic (1997)"                  

[[9]]
[1] "Fargo (1996)"                     "Silence of the Lambs, The (1991)"
[3] "Godfather, The (1972)"            "Raiders of the Lost Ark (1981)"  
[5] "Pulp Fiction (1994)"             

[[10]]
[1] "Star Wars (1977)"                 "Fargo (1996)"                    
[3] "Silence of the Lambs, The (1991)" "Godfather, The (1972)"           
[5] "Raiders of the Lost Ark (1981)"

Model#2: Ratings - we will predict Ratings of films from our model for next 10 films

10 x 1664 rating matrix of class 'realRatingMatrix' with 14963 ratings.

    Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995)
501        0.2732034       -0.3586763        -0.3781417        0.06586568
502        0.2732034       -0.3586763        -0.3781417        0.06586568
503               NA       -0.3586763        -0.3781417        0.06586568
504        0.2732034       -0.3586763        -0.3781417                NA
505               NA       -0.3586763        -0.3781417        0.06586568
506        0.2732034               NA        -0.3781417        0.06586568
507        0.2732034       -0.3586763        -0.3781417        0.06586568
508               NA       -0.3586763        -0.3781417        0.06586568
509        0.2732034       -0.3586763        -0.3781417        0.06586568
510        0.2732034       -0.3586763        -0.3781417        0.06586568
    Copycat (1995) Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
501     -0.2864243                                            0.2475417
502     -0.2864243                                            0.2475417
503     -0.2864243                                            0.2475417
504             NA                                            0.2475417
505     -0.2864243                                            0.2475417
506             NA                                            0.2475417
507     -0.2864243                                            0.2475417
508     -0.2864243                                            0.2475417
509     -0.2864243                                            0.2475417
510     -0.2864243                                            0.2475417
    Twelve Monkeys (1995) Babe (1995) Dead Man Walking (1995)
501                    NA   0.4471136                0.394184
502             0.1980479   0.4471136                0.394184
503             0.1980479          NA                0.394184
504             0.1980479   0.4471136                      NA
505                    NA   0.4471136                0.394184
506             0.1980479          NA                0.394184
507             0.1980479   0.4471136                0.394184
508             0.1980479   0.4471136                0.394184
509             0.1980479   0.4471136                0.394184
510             0.1980479   0.4471136                0.394184
    Richard III (1995)
501          0.3051623
502          0.3051623
503                 NA
504          0.3051623
505          0.3051623
506                 NA
507          0.3051623
508          0.3051623
509          0.3051623
510          0.3051623

Lets do Evaluvation for our model with method “Split”

Evaluation scheme with 15 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.750
Good ratings: >=4.000000
Data set: 500 x 1664 rating matrix of class 'realRatingMatrix' with 56431 ratings.

Lets Create TWO Reccommenders: USER BASED & ITEM BASED COLLABRATIVE FILTERING by using the above training data

Method# 2

USER BASED COLLABRATIVE FILTERING (UBCF)

Recommender of type 'UBCF' for 'realRatingMatrix' 
learned using 375 users.

Method# 3

ITEM BASED COLLABRATIVE FILTERING (IBCF)

Recommender of type 'IBCF' for 'realRatingMatrix' 
learned using 375 users.

Predict ratings for the known part of test data using above two algorithms

Model#1 (Ratings): Using training data given 15 Items for each user (Evaluvation) & R1 “UBCF”

125 x 1664 rating matrix of class 'realRatingMatrix' with 206125 ratings.

Model#2 (Ratings): Using training data given 15 Items for each user (Evaluvation) & R2 “IBCF”

125 x 1664 rating matrix of class 'realRatingMatrix' with 35735 ratings.

Finding errors between above 2 algorithms

         RMSE      MSE       MAE
UBCF 1.096151 1.201546 0.8800528
IBCF 1.203950 1.449496 0.8558741

Evaluation of a top-N recommender algorithm 4-fold cross validation scheme with the the Given-3 protocol, i.e., for the test users all but three randomly selected items are withheld for evaluation

Evaluation scheme with 3 items given
Method: 'cross-validation' with 4 run(s).
Good ratings: >=4.000000
Data set: 500 x 1664 rating matrix of class 'realRatingMatrix' with 56431 ratings.

We use the created evaluation scheme to evaluate the recommender method popular
We evaluate top-1, top-3, top-5, top-10, top-15 and top-20 recommendation lists.

POPULAR run 
     1  [0.03sec/0.28sec] 
     2  [0.02sec/0.25sec] 
     3  [0.02sec/0.27sec] 
     4  [0.02sec/0.25sec]

Evaluation results for 4 runs using method 'POPULAR'.

Confusion Matrix for above

      TP     FP     FN       TN precision     recall        TPR
1  0.536  0.464 56.008 1603.992 0.5360000 0.01321655 0.01321655
3  1.408  1.592 55.136 1602.864 0.4693333 0.03200618 0.03200618
5  1.944  3.056 54.600 1601.400 0.3888000 0.04081173 0.04081173
10 3.520  6.480 53.024 1597.976 0.3520000 0.07073502 0.07073502
15 4.880 10.120 51.664 1594.336 0.3253333 0.09563071 0.09563071
20 5.952 14.048 50.592 1590.408 0.2976000 0.12089959 0.12089959
            FPR
1  0.0002849276
3  0.0009801296
5  0.0018849863
10 0.0039971671
15 0.0062472511
20 0.0086845888

Averaging out the Confusion Matrix for above

      TP     FP     FN       TN precision     recall        TPR
1  0.532  0.468 60.918 1599.082 0.5320000 0.01204009 0.01204009
3  1.418  1.582 60.032 1597.968 0.4726667 0.02957921 0.02957921
5  2.074  2.926 59.376 1596.624 0.4148000 0.04052043 0.04052043
10 3.670  6.330 57.780 1593.220 0.3670000 0.07086318 0.07086318
15 5.112  9.888 56.338 1589.662 0.3408000 0.10122636 0.10122636
20 6.322 13.678 55.128 1585.872 0.3161000 0.12356668 0.12356668
            FPR
1  0.0002876412
3  0.0009744666
5  0.0018043129
10 0.0039090485
15 0.0061149169
20 0.0084660350

Plotting the results

Now lets do Comparision of the methods

Evaluation scheme with 5 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.750
Good ratings: >=4.000000
Data set: 500 x 1664 rating matrix of class 'realRatingMatrix' with 56431 ratings.

Lets run the algorithms

RANDOM run 
     1  [0sec/1.39sec] 
POPULAR run 
     1  [0.01sec/0.38sec] 
UBCF run 
     1  [0.02sec/42.62sec]

List of evaluation results for 3 recommenders:
Evaluation results for 1 runs using method 'RANDOM'.
Evaluation results for 1 runs using method 'POPULAR'.
Evaluation results for 1 runs using method 'UBCF'.

[1] "random items"  "popular items" "user-based CF"

Evaluation results for 1 runs using method 'UBCF'.

Plotting the results

For this data set and the given evaluation scheme the user-based and item-based CF methods has outperformed compared to other models

In TPR graph we see that they dominate the other method since for each length of top-N list they provide a better combination of TPR and FPR

Hence, we will check with less information, for comparison we will check how the algorithms compare given less information

809 x 1664 rating matrix of class 'binaryRatingMatrix' with 79787 ratings.

Evaluation scheme with 20 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.750
Good ratings: NA
Data set: 500 x 1664 rating matrix of class 'binaryRatingMatrix' with 51862 ratings.

RANDOM run 
     1  [0sec/1.52sec] 
POPULAR run 
     1  [0sec/1.91sec] 
UBCF run 
     1  [0sec/5.73sec]

Plotting the results

Conclusion

From our initial models, its is understood that both IBCF & UBCF models work better when compared to other given model, however based on our latest inputs with given lesser information for our model, the UBCF perfroms better than the other models

We conclude that, both UBCF and IBCF models performs well for this dataset.