PROJECT ON RECOMMENDER SYSTEM


Presented By: Abinaya, Badrinath, Rajkumar


Date: 19th Mar 2016


Name of the Data Set: Movie Lense 100k Dataset


CONTENTS


1. Introduction, Reference and Summary about the dataset         
2. Basic Graphical representation about the dataset      
3. Creating a recommender models       
   3.1. Creating recommender based on popularity of items       
   3.2. Creating top 10 recommendations for 10 users      
   3.3. Extracting best 5 recommendations for 10 users      
   3.4. Predict ratings for next 10 films     
4. Evaluvation of predicted ratings    
   4.1. Create recommenders User Based      
   4.2. Create recommenders Item Based      
   4.3. Compute predicted ratings for the known part of test data      
   4.4. Compute error between prediction & the unknown part of test data       
5. Evaluation of a top-N recommender algorithm       
   5.1. Evaluvation using 4 fold cross validation       
   5.2. Evaluvate the recommender method Popular      
   5.3. Confusion Matrix       
   5.4. Averaging the Confusion Matrix       
   5.5. ROC Curve Plot (TPR & FPR)          
   5.6. Precision-Recall plot        
6. Comparing Recommender Algorithms          
   6.1. Comparing top-N recommendations          
   6.2. Run the algorithms - topNList        
   6.3. Plotting the ROC Curve        
   6.4. Plotting the Comparision of Precision-recall curve           
7. Converting in to Binary Matix & Comparing algorithms        
   7.1. Converting matrix to Binary Matrix           
   7.2. Evaluvating the scheme          
   7.3. Plotting the results          
8. Conclusion          

1. INTRODUCTION & INSIGHTS ABOUT THE DATASET

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.   
 
This data set consists of:   

1. 100,000 ratings (1-5) from 943 users on 1682 movies.   
2. Each user has rated at least 20 movies.    
3. Simple demographic info for the users (age, gender, occupation, zip)   

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th,1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set.       

REFERENCE:

Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic Framework for Performing Collaborative Filtering. Proceedings of the 1999 Conference on Research and Development in Information Retrieval. Aug. 1999.    

SUMMARY ABOUT DATASET

No of Users Rated: 993    
No of Films Rated: 1664    
Total Ratings:     99392   

2. BASIC GRAPHICAL REPRESENTATIONS OF THE DATASET








3. CREATING RECOMMENDER MODELS:


$IBCF_realRatingMatrix
Recommender method: IBCF
Description: Recommender based on item-based collaborative filtering (real data).
Parameters:
   k method normalize normalize_sim_matrix alpha na_as_zero minRating
1 30 Cosine    center                FALSE   0.5      FALSE        NA

$PCA_realRatingMatrix
Recommender method: PCA
Description: Recommender based on PCA approximation (real data).
Parameters:
  categories method normalize normalize_sim_matrix alpha na_as_zero
1         20 Cosine    center                FALSE   0.5      FALSE
  minRating
1        NA

$POPULAR_realRatingMatrix
Recommender method: POPULAR
Description: Recommender based on item popularity (real data).
Parameters: None

$RANDOM_realRatingMatrix
Recommender method: RANDOM
Description: Produce random recommendations (real ratings).
Parameters: None

$SVD_realRatingMatrix
Recommender method: SVD
Description: Recommender based on SVD approximation (real data).
Parameters:
  categories method normalize normalize_sim_matrix alpha treat_na
1         50 Cosine    center                FALSE   0.5   median
  minRating
1        NA

$UBCF_realRatingMatrix
Recommender method: UBCF
Description: Recommender based on user-based collaborative filtering (real data).
Parameters:
  method nn sample normalize minRating
1 cosine 25  FALSE    center        NA

LETS CREATE RECOMMENDER BY TAKING 200 SAMPLE USERS

3.1.Method#1:“Popular Method” - popularity of items

Recommender of type 'POPULAR' for 'realRatingMatrix' 
learned using 200 users.
[1] "topN"        "ratings"     "normalize"   "aggregation"

Model#1:topN

Recommendations as 'topNList' with n = 1664 for 1 users. 

In the above case, the model has a top-N list to store the popularity order.


3.2.Creating recommendation lists for next 10 other users for 10 films who were not used to learn the model;

Recommendations as 'topNList' with n = 10 for 10 users. 
[[1]]
 [1] "2001: A Space Odyssey (1968)"                                               
 [2] "Monty Python and the Holy Grail (1974)"                                     
 [3] "Postino, Il (1994)"                                                         
 [4] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)"
 [5] "Sting, The (1973)"                                                          
 [6] "As Good As It Gets (1997)"                                                  
 [7] "Young Frankenstein (1974)"                                                  
 [8] "To Kill a Mockingbird (1962)"                                               
 [9] "Cool Hand Luke (1967)"                                                      
[10] "Bridge on the River Kwai, The (1957)"                                       

[[2]]
 [1] "Star Wars (1977)"                 "Fargo (1996)"                    
 [3] "Raiders of the Lost Ark (1981)"   "Shawshank Redemption, The (1994)"
 [5] "Return of the Jedi (1983)"        "Godfather, The (1972)"           
 [7] "Titanic (1997)"                   "Silence of the Lambs, The (1991)"
 [9] "Pulp Fiction (1994)"              "Alien (1979)"                    

[[3]]
 [1] "Raiders of the Lost Ark (1981)"   "Shawshank Redemption, The (1994)"
 [3] "Godfather, The (1972)"            "Titanic (1997)"                  
 [5] "Silence of the Lambs, The (1991)" "Schindler's List (1993)"         
 [7] "Pulp Fiction (1994)"              "Empire Strikes Back, The (1980)" 
 [9] "Princess Bride, The (1987)"       "Alien (1979)"                    

[[4]]
 [1] "Star Wars (1977)"                 "Fargo (1996)"                    
 [3] "Raiders of the Lost Ark (1981)"   "Shawshank Redemption, The (1994)"
 [5] "Return of the Jedi (1983)"        "Godfather, The (1972)"           
 [7] "Titanic (1997)"                   "Silence of the Lambs, The (1991)"
 [9] "Pulp Fiction (1994)"              "Princess Bride, The (1987)"      

[[5]]
 [1] "Star Wars (1977)"                 "Fargo (1996)"                    
 [3] "Raiders of the Lost Ark (1981)"   "Shawshank Redemption, The (1994)"
 [5] "Return of the Jedi (1983)"        "Godfather, The (1972)"           
 [7] "Silence of the Lambs, The (1991)" "Schindler's List (1993)"         
 [9] "Pulp Fiction (1994)"              "Empire Strikes Back, The (1980)" 

[[6]]
 [1] "Star Wars (1977)"                 "Fargo (1996)"                    
 [3] "Raiders of the Lost Ark (1981)"   "Shawshank Redemption, The (1994)"
 [5] "Return of the Jedi (1983)"        "Godfather, The (1972)"           
 [7] "Silence of the Lambs, The (1991)" "Schindler's List (1993)"         
 [9] "Pulp Fiction (1994)"              "Empire Strikes Back, The (1980)" 

[[7]]
 [1] "Star Wars (1977)"                                                           
 [2] "Empire Strikes Back, The (1980)"                                            
 [3] "Blade Runner (1982)"                                                        
 [4] "Toy Story (1995)"                                                           
 [5] "Good Will Hunting (1997)"                                                   
 [6] "Twelve Monkeys (1995)"                                                      
 [7] "Lawrence of Arabia (1962)"                                                  
 [8] "Monty Python and the Holy Grail (1974)"                                     
 [9] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)"
[10] "Glory (1989)"                                                               

[[8]]
 [1] "Star Wars (1977)"                 "Fargo (1996)"                    
 [3] "Raiders of the Lost Ark (1981)"   "Shawshank Redemption, The (1994)"
 [5] "Return of the Jedi (1983)"        "Godfather, The (1972)"           
 [7] "Titanic (1997)"                   "Silence of the Lambs, The (1991)"
 [9] "Contact (1997)"                   "Schindler's List (1993)"         

[[9]]
 [1] "Fargo (1996)"                     "Raiders of the Lost Ark (1981)"  
 [3] "Shawshank Redemption, The (1994)" "Titanic (1997)"                  
 [5] "Silence of the Lambs, The (1991)" "Schindler's List (1993)"         
 [7] "Pulp Fiction (1994)"              "Empire Strikes Back, The (1980)" 
 [9] "Princess Bride, The (1987)"       "Alien (1979)"                    

[[10]]
 [1] "Fargo (1996)"                     "Shawshank Redemption, The (1994)"
 [3] "Titanic (1997)"                   "Contact (1997)"                  
 [5] "Schindler's List (1993)"          "Alien (1979)"                    
 [7] "Blade Runner (1982)"              "Usual Suspects, The (1995)"      
 [9] "Braveheart (1995)"                "Good Will Hunting (1997)"        

3.3. Recommendation as ‘topNList’ for 3 movies and 10 users based on above prediction

Recommendations as 'topNList' with n = 3 for 10 users. 
[[1]]
[1] "2001: A Space Odyssey (1968)"          
[2] "Monty Python and the Holy Grail (1974)"
[3] "Postino, Il (1994)"                    

[[2]]
[1] "Star Wars (1977)"               "Fargo (1996)"                  
[3] "Raiders of the Lost Ark (1981)"

[[3]]
[1] "Raiders of the Lost Ark (1981)"   "Shawshank Redemption, The (1994)"
[3] "Godfather, The (1972)"           

[[4]]
[1] "Star Wars (1977)"               "Fargo (1996)"                  
[3] "Raiders of the Lost Ark (1981)"

[[5]]
[1] "Star Wars (1977)"               "Fargo (1996)"                  
[3] "Raiders of the Lost Ark (1981)"

[[6]]
[1] "Star Wars (1977)"               "Fargo (1996)"                  
[3] "Raiders of the Lost Ark (1981)"

[[7]]
[1] "Star Wars (1977)"                "Empire Strikes Back, The (1980)"
[3] "Blade Runner (1982)"            

[[8]]
[1] "Star Wars (1977)"               "Fargo (1996)"                  
[3] "Raiders of the Lost Ark (1981)"

[[9]]
[1] "Fargo (1996)"                     "Raiders of the Lost Ark (1981)"  
[3] "Shawshank Redemption, The (1994)"

[[10]]
[1] "Fargo (1996)"                     "Shawshank Redemption, The (1994)"
[3] "Titanic (1997)"                  

3.4. Model#2: Ratings - we will predict Ratings of films from our model for next 10 films

10 x 1664 rating matrix of class 'realRatingMatrix' with 13117 ratings.
    Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995)
201               NA               NA        -0.3653484                NA
202               NA       -0.3134797        -0.3653484         0.1297056
203               NA       -0.3134797        -0.3653484         0.1297056
204               NA       -0.3134797        -0.3653484         0.1297056
205        0.3485342       -0.3134797        -0.3653484         0.1297056
206        0.3485342       -0.3134797        -0.3653484         0.1297056
207        0.3485342               NA                NA                NA
208        0.3485342       -0.3134797        -0.3653484         0.1297056
209               NA       -0.3134797        -0.3653484         0.1297056
210               NA       -0.3134797        -0.3653484                NA
    Copycat (1995) Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
201     -0.3799844                                            0.1791742
202     -0.3799844                                            0.1791742
203     -0.3799844                                            0.1791742
204     -0.3799844                                            0.1791742
205     -0.3799844                                            0.1791742
206     -0.3799844                                            0.1791742
207             NA                                            0.1791742
208     -0.3799844                                            0.1791742
209     -0.3799844                                            0.1791742
210     -0.3799844                                            0.1791742
    Twelve Monkeys (1995) Babe (1995) Dead Man Walking (1995)
201                    NA          NA                      NA
202             0.2912737    0.593237               0.4334513
203                    NA    0.593237               0.4334513
204             0.2912737    0.593237                      NA
205             0.2912737    0.593237               0.4334513
206             0.2912737    0.593237               0.4334513
207             0.2912737          NA                      NA
208             0.2912737    0.593237               0.4334513
209             0.2912737    0.593237                      NA
210             0.2912737    0.593237               0.4334513
    Richard III (1995)
201                 NA
202          0.3842573
203          0.3842573
204          0.3842573
205          0.3842573
206          0.3842573
207          0.3842573
208          0.3842573
209          0.3842573
210          0.3842573

The prediction contains NA for the items rated by the active users, In the example we show the predicted ratings for the first 10 items for 10 users


4. Evaluvation of predicted ratings


Creating Evaluvation scheme which splits first 200 users in to train (90%) and test (10%). For the test set 15 items will be given to the algorithm and the other items will be held out for computing the error


Evaluation scheme with 15 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.900
Good ratings: >=5.000000
Data set: 200 x 1664 rating matrix of class 'realRatingMatrix' with 19611 ratings.

Create TWO Reccommenders: USER BASED & ITEM BASED COLLABRATIVE FILTERING by using the above training data


Method# 2

4.1 USER BASED COLLABRATIVE FILTERING (UBCF)

Recommender of type 'UBCF' for 'realRatingMatrix' 
learned using 180 users.

Method# 3

4.2 ITEM BASED COLLABRATIVE FILTERING (IBCF)

Recommender of type 'IBCF' for 'realRatingMatrix' 
learned using 180 users.

4.3. Predict ratings for the known part of test data (15 items for each user) using above two algorithms


Model#1 (Ratings): Predict ratings for the know part of test data (given 15 Items for each user) with “UBCF”

20 x 1664 rating matrix of class 'realRatingMatrix' with 32980 ratings.

Model#2 (Ratings): Predict ratings for the know part of test data (given 15 Items for each user) with “IBCF”

20 x 1664 rating matrix of class 'realRatingMatrix' with 5218 ratings.

4.4 Calculate errors between the prediction & the unknownpart of test data

         RMSE      MSE       MAE
UBCF 1.226044 1.503185 0.9859527
IBCF 1.342349 1.801900 0.9807998

5. Evaluation of a top-N recommender algorithm


5.1 Evaluvation using 4 fold cross validation


4-fold cross validation scheme with the the Given-3 protocol, i.e., for the test users all but three randomly selected items are withheld for evaluation

Evaluation scheme with 3 items given
Method: 'cross-validation' with 4 run(s).
Good ratings: >=5.000000
Data set: 200 x 1664 rating matrix of class 'realRatingMatrix' with 19611 ratings.

5.2 Evaluvate the recommender method Popular

Then, We use the created evaluation scheme to evaluate the recommender method popular,We evaluate top-1, top-3, top-5, top-10, top-15 and top-20 recommendation lists.

POPULAR run 
     1  [0.02sec/0.08sec] 
     2  [0.03sec/0.11sec] 
     3  [0.03sec/0.13sec] 
     4  [0.02sec/0.1sec] 
Evaluation results for 4 runs using method 'POPULAR'.

5.3 Confusion Matrix for above

     TP    FP    FN      TN precision     recall        TPR          FPR
1  0.40  0.60 24.44 1635.56     0.400 0.01710254 0.01710254 0.0003644747
3  1.02  1.98 23.82 1634.18     0.340 0.04558676 0.04558676 0.0012022984
5  1.38  3.62 23.46 1632.54     0.276 0.06023690 0.06023690 0.0022017276
10 2.42  7.58 22.42 1628.58     0.242 0.13740461 0.13740461 0.0046162032
15 3.06 11.94 21.78 1624.22     0.204 0.15544386 0.15544386 0.0072731142
20 3.62 16.38 21.22 1619.78     0.181 0.17886515 0.17886515 0.0099791836

5.4 Averaging the Confusion Matrix for above

      TP     FP     FN       TN precision     recall        TPR
1  0.335  0.665 21.970 1638.030 0.3350000 0.02006051 0.02006051
3  0.865  2.135 21.440 1636.560 0.2883333 0.04663320 0.04663320
5  1.225  3.775 21.080 1634.920 0.2450000 0.06464223 0.06464223
10 2.235  7.765 20.070 1630.930 0.2235000 0.13622210 0.13622210
15 2.935 12.065 19.370 1626.630 0.1956667 0.17132065 0.17132065
20 3.520 16.480 18.785 1622.215 0.1760000 0.20042357 0.20042357
            FPR
1  0.0004041447
3  0.0012975391
5  0.0022960007
10 0.0047251530
15 0.0073435834
20 0.0100325927

5.5 Plotting the results: ROC Curve - plotting True Positive Rate (TPR) against Falsr Positive Rate (FPR)


5.6 Plotting the results: Precision - Recall plot


6. Comparing Recommender Algorithms


6.1. Comparing top-N recommendations

Evaluation scheme with 5 items given
Method: 'split' with 4 run(s).
Training set proportion: 0.900
Good ratings: >=5.000000
Data set: 200 x 1664 rating matrix of class 'realRatingMatrix' with 19611 ratings.

6.2. Lets run the algorithms - topNList

RANDOM run 
     1  [0sec/0.13sec] 
     2  [0.02sec/0.23sec] 
     3  [0.02sec/0.22sec] 
     4  [0sec/0.2sec] 
POPULAR run 
     1  [0.01sec/0.03sec] 
     2  [0.02sec/0.04sec] 
     3  [0.02sec/0.04sec] 
     4  [0.03sec/0.03sec] 
UBCF run 
     1  [0sec/3.32sec] 
     2  [0sec/2.89sec] 
     3  [0.01sec/3.64sec] 
     4  [0sec/2.08sec] 
List of evaluation results for 3 recommenders:
Evaluation results for 4 runs using method 'RANDOM'.
Evaluation results for 4 runs using method 'POPULAR'.
Evaluation results for 4 runs using method 'UBCF'.
[1] "random items"  "popular items" "user-based CF"
Evaluation results for 4 runs using method 'UBCF'.

6.3. Plotting the ROC Curve for several recommender for given 3 evaluvation scheme


6.4. Plotting the Comparision of Precision-recall curve given 3 evaluvation scheme


6.5. Run the algorithms - ratings

RANDOM run 
     1  [0.02sec/0.01sec] 
     2  [0sec/0.03sec] 
     3  [0.02sec/0.03sec] 
     4  [0.02sec/0.02sec] 
POPULAR run 
     1  [0.01sec/0.03sec] 
     2  [0.02sec/0.04sec] 
     3  [0.03sec/0.03sec] 
     4  [0.02sec/0.04sec] 
UBCF run 
     1  [0.02sec/3.03sec] 
     2  [0.02sec/3.82sec] 
     3  [0.01sec/2.75sec] 
     4  [0sec/2.95sec] 
List of evaluation results for 3 recommenders:
Evaluation results for 4 runs using method 'RANDOM'.
Evaluation results for 4 runs using method 'POPULAR'.
Evaluation results for 4 runs using method 'UBCF'.
[1] "random items"  "popular items" "user-based CF"

6.6. Plotting the results - ratings

******

For this data set and the given evaluation scheme the user-based and item-based CF methods has outperformed compared to other models


7.Converting in to Binary Matix & Comparing algorithms

We will check with less information, for comparison we will check how the algorithms compare given less information

7.1. Converting matrix to Binary Matrix

898 x 1664 rating matrix of class 'binaryRatingMatrix' with 98506 ratings.

7.2. Evaluvating the scheme

Evaluation scheme with 20 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.900
Good ratings: NA
Data set: 500 x 1664 rating matrix of class 'binaryRatingMatrix' with 58119 ratings.

RANDOM run 
     1  [0.02sec/0.47sec] 
POPULAR run 
     1  [0.01sec/0.72sec] 
UBCF run 
     1  [0.02sec/2.5sec] 

7.3. Plotting the results


8. Conclusion

In the initial "Real Rating matrix"", comparing the alogorithms, the POPULAR method performed well both in ROC Curve and Precision\Recall and when comparing the ratings UBCF performs better than other models.

Once the dataset been converted in to "Binary matrix" and compared to all models again the UBCF model performs well