Project 3

Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system. You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found. SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).

I chose ml-latest-small dataset from movielens. This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Ratings Data File Structure (ratings.csv)

All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

userId,movieId,rating,timestamp

Movies Data File Structure (movies.csv)

Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,title,genres

Download Data

system("wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip -p -k --random-wait")
system("unzip -f ml-latest-small.zip")
movies <- read.csv('ml-latest-small/movies.csv')
ratings_data <- read.csv('ml-latest-small/ratings.csv')
#library(dplyr)
library(tidyr)
library(recommenderlab)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loading required package: arules

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## Loading required package: proxy

## 
## Attaching package: 'proxy'

## The following object is masked from 'package:Matrix':
## 
##     as.matrix

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loading required package: registry

## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy

Data Exploration

summary(ratings_data)

##      userId         movieId           rating        timestamp        
##  Min.   :  1.0   Min.   :     1   Min.   :0.500   Min.   :8.281e+08  
##  1st Qu.:177.0   1st Qu.:  1199   1st Qu.:3.000   1st Qu.:1.019e+09  
##  Median :325.0   Median :  2991   Median :3.500   Median :1.186e+09  
##  Mean   :326.1   Mean   : 19435   Mean   :3.502   Mean   :1.206e+09  
##  3rd Qu.:477.0   3rd Qu.:  8122   3rd Qu.:4.000   3rd Qu.:1.436e+09  
##  Max.   :610.0   Max.   :193609   Max.   :5.000   Max.   :1.538e+09

head(ratings_data)

##   userId movieId rating timestamp
## 1      1       1      4 964982703
## 2      1       3      4 964981247
## 3      1       6      4 964982224
## 4      1      47      5 964983815
## 5      1      50      5 964982931
## 6      1      70      3 964982400

summary(movies)

##     movieId                                          title     
##  Min.   :     1   Confessions of a Dangerous Mind (2002):   2  
##  1st Qu.:  3248   Emma (1996)                           :   2  
##  Median :  7300   Eros (2004)                           :   2  
##  Mean   : 42200   Saturn 3 (1980)                       :   2  
##  3rd Qu.: 76232   War of the Worlds (2005)              :   2  
##  Max.   :193609   ¡Three Amigos! (1986)                 :   1  
##                   (Other)                               :9731  
##             genres    
##  Drama         :1053  
##  Comedy        : 946  
##  Comedy|Drama  : 435  
##  Comedy|Romance: 363  
##  Drama|Romance : 349  
##  Documentary   : 339  
##  (Other)       :6257

head(movies)

##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

Build a user matrix with movies as columns

rating_mat <- spread(ratings_data[,1:3], movieId, rating)

rating_mat <- as.matrix(rating_mat[,-1]) #remove userIds

Convert into a recommenderlab sparse matrix

rating_mat <- as(rating_mat, "realRatingMatrix") 
image(rating_mat[1:100,1:100])

recommender_matrix <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
names(recommender_matrix)

##  [1] "HYBRID_realRatingMatrix"       "ALS_realRatingMatrix"         
##  [3] "ALS_implicit_realRatingMatrix" "IBCF_realRatingMatrix"        
##  [5] "LIBMF_realRatingMatrix"        "POPULAR_realRatingMatrix"     
##  [7] "RANDOM_realRatingMatrix"       "RERECOMMEND_realRatingMatrix" 
##  [9] "SVD_realRatingMatrix"          "SVDF_realRatingMatrix"        
## [11] "UBCF_realRatingMatrix"

lapply(recommender_matrix, "[[", "description")

## $HYBRID_realRatingMatrix
## [1] "Hybrid recommender that aggegates several recommendation strategies using weighted averages."
## 
## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
## 
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
## 
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
## 
## $LIBMF_realRatingMatrix
## [1] "Matrix factorization with LIBMF via package recosystem (https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html)."
## 
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
## 
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
## 
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
## 
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
## 
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend (https://sifter.org/~simon/journal/20061211.html)."
## 
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."

SVD Parameters

recommender_matrix$SVD_realRatingMatrix$parameters

## $k
## [1] 10
## 
## $maxiter
## [1] 100
## 
## $normalize
## [1] "center"

Determine similarity between users First 4 users

similarity_users <- similarity(rating_mat[1:4, ], method = "cosine", which = "users")
as.matrix(similarity_users)

##           1  2         3         4
## 1 0.0000000  1 0.7919033 0.9328096
## 2 1.0000000  0        NA 1.0000000
## 3 0.7919033 NA 0.0000000 1.0000000
## 4 0.9328096  1 1.0000000 0.0000000

image(as.matrix(similarity_users), main = "User similarity")

Determine similarity between items First 4 Movies

similarity_items <- similarity(rating_mat[, 1:4], method = "cosine", which ="items")
as.matrix(similarity_items)

##           1         2         3         4
## 1 0.0000000 0.9644641 0.9715415 0.9838699
## 2 0.9644641 0.0000000 0.9389013 0.9609877
## 3 0.9715415 0.9389013 0.0000000 1.0000000
## 4 0.9838699 0.9609877 1.0000000 0.0000000

image(as.matrix(similarity_items), main = "Movies similarity")

Explore ratings_data distribution

vector_ratings_data <- as.vector(rating_mat@data)
table(vector_ratings_data)

## vector_ratings_data
##       0     0.5       1     1.5       2     2.5       3     3.5       4     4.5 
## 5830804    1370    2811    1791    7551    5550   20047   13136   26818    8551 
##       5 
##   13211

unique(vector_ratings_data)

##  [1] 4.0 0.0 4.5 2.5 3.5 3.0 5.0 0.5 2.0 1.5 1.0

Explore movie performance

views_per_movie <- colCounts(rating_mat) # count views for each movie
table_views <- data.frame(movie = names(views_per_movie),views = views_per_movie)
table_views <- table_views[order(table_views$views, decreasing = TRUE), ] # sort by number of views
table_views$title <- NA
head(table_views)

##      movie views title
## 356    356   329    NA
## 318    318   317    NA
## 296    296   307    NA
## 593    593   279    NA
## 2571  2571   278    NA
## 260    260   251    NA

for (i in 1:nrow(table_views)){
  table_views[i,3] <- as.character(subset(movies, movies$movieId == table_views[i,1])$title)
}
head(table_views)

##      movie views                                     title
## 356    356   329                       Forrest Gump (1994)
## 318    318   317          Shawshank Redemption, The (1994)
## 296    296   307                       Pulp Fiction (1994)
## 593    593   279          Silence of the Lambs, The (1991)
## 2571  2571   278                        Matrix, The (1999)
## 260    260   251 Star Wars: Episode IV - A New Hope (1977)

Consider only movies with total of views higher than 50 views

average_ratings_data <- colMeans(rating_mat)
average_ratings_data_relevant <- average_ratings_data[views_per_movie > 50]

Only 436 movies have more than 50 views

Consider movies for a Minimum of 50 users per rates movie and 50 views per movie.

ratings_data_relevant <- rating_mat[rowCounts(rating_mat) > 50, colCounts(rating_mat) > 50]

ratings_data_relevant

## 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.

vector_ratings_data_relevant <- as.vector(ratings_data_relevant@data)
table(vector_ratings_data_relevant)

## vector_ratings_data_relevant
##      0    0.5      1    1.5      2    2.5      3    3.5      4    4.5      5 
## 128594    322    694    367   1833   1479   6279   4605  10552   3742   6341

Defining Train and Test data sets

train_filter <- sample(x = c(TRUE, FALSE), size = nrow(ratings_data_relevant),replace = TRUE, prob = c(0.8, 0.2))

train_ratings_data <- as(ratings_data_relevant[train_filter, ], "realRatingMatrix") 
test_ratings_data <- as(ratings_data_relevant[!train_filter, ], "realRatingMatrix")

Normalize data

train_ratings_data <- normalize(train_ratings_data)
test_ratings_data <- normalize(test_ratings_data)

Create Recommender Model. Based on SVD approximation

recommender_model <- Recommender(train_ratings_data, method = "SVD", param=list(k=10,maxiter=100,normalize="center"))

## Warning in .local(x, ...): x was already normalized by row!

# Top 10 recommendations for users (1-10)
recom <- predict(recommender_model, newdata=test_ratings_data, n=10, type="topNList")

## Warning in .local(x, ...): x was already normalized by row!

recom_list <- as(recom, "list")

recom_result <- list()
for (i in c(1:10)){
 recom_result[[i]] <- movies[as.integer(recom_list[[i]]),2]
}
#library(knitr)
recom_result_df <- as.data.frame(recom_result)
colnames(recom_result_df) <- seq(1,10,1)
head(recom_result_df)

##                                                 1                          2
## 1                        Full Metal Jacket (1987)        Pat and Mike (1952)
## 2                                 Phantoms (1998)        Dr. Dolittle (1998)
## 3                             If Lucy Fell (1996)    Aristocrats, The (2005)
## 4                         Monster in a Box (1992) Usual Suspects, The (1995)
## 5 Haunted World of Edward D. Wood Jr., The (1996)      Jupiter's Wife (1994)
## 6                         Mighty Aphrodite (1995)       Pete's Dragon (1977)
##                                                     3
## 1                        Great White Hype, The (1996)
## 2 City Slickers II: The Legend of Curly's Gold (1994)
## 3                                 Dr. Dolittle (1998)
## 4                             Fierce Creatures (1997)
## 5                               Kiss Me, Guido (1997)
## 6                                   Sgt. Bilko (1996)
##                                                   4
## 1                                   Fog, The (1980)
## 2 Star Wars: Episode VI - Return of the Jedi (1983)
## 3                                   Out Cold (2001)
## 4                             Kiss the Girls (1997)
## 5                           Dead Man Walking (1995)
## 6                                  Dobermann (1997)
##                              5
## 1            Wonderland (1999)
## 2 Welcome to Collinwood (2002)
## 3              Cop Land (1997)
## 4         Flesh & Blood (1985)
## 5            Cat People (1982)
## 6                Kissed (1996)
##                                                     6
## 1                   NeverEnding Story III, The (1994)
## 2                        Great White Hype, The (1996)
## 3 City Slickers II: The Legend of Curly's Gold (1994)
## 4      Bread and Chocolate (Pane e cioccolata) (1973)
## 5                       Angels in the Outfield (1994)
## 6                        Age of Innocence, The (1993)
##                                             7
## 1 Time Masters (Maîtres du temps, Les) (1982)
## 2                       Kiss Me, Guido (1997)
## 3        Kid in King Arthur's Court, A (1995)
## 4                               Snatch (2000)
## 5                Whole Wide World, The (1996)
## 6                   Twelve Chairs, The (1970)
##                                             8
## 1 Time Masters (Maîtres du temps, Les) (1982)
## 2                             Monsters (2010)
## 3                 Vanya on 42nd Street (1994)
## 4                    Conspiracy Theory (1997)
## 5                Welcome to Collinwood (2002)
## 6                                        <NA>
##                                                9
## 1             What's Eating Gilbert Grape (1993)
## 2                    Addams Family Values (1993)
## 3                              Wonderland (1999)
## 4 Bread and Chocolate (Pane e cioccolata) (1973)
## 5                   Great White Hype, The (1996)
## 6              Nightmare on Elm Street, A (1984)
##                                10
## 1    Welcome to Collinwood (2002)
## 2 Children of a Lesser God (1986)
## 3              Thumbsucker (2005)
## 4                Dobermann (1997)
## 5           Kiss the Girls (1997)
## 6                   Casino (1995)

Ratings assigned to the movies

recomr <- predict(recommender_model, newdata=test_ratings_data,  type="ratingMatrix")

## Warning in .local(x, ...): x was already normalized by row!

recomr_mat <- as(recomr, "matrix")
recomr_mat[1:5,1:5] #First 5 users and first 5 movies

##             1        2        3        6        7
## [1,] 3.784563 3.880402 3.845442 3.731974 3.833721
## [2,] 3.603780 3.672544 3.656996 3.611870 3.644504
## [3,] 4.305907 4.323836 4.323536 4.297794 4.327453
## [4,] 3.255318 3.066650 3.199115 3.280091 3.095823
## [5,] 3.767017 3.527568 3.620133 3.955714 3.620279

Evaluating Model with Cross-validation

eval_sch <- evaluationScheme(ratings_data_relevant, method="cross-validation", k=4, given=10, goodRating=3)
recommender_model <- Recommender(getData(eval_sch,"train"), method = "SVD", param=list(k=10,maxiter=100,normalize="center"))
recomcv <- predict(recommender_model, newdata=getData(eval_sch,"known"), n=10, type="topNList")

Performance index of the whole model

eval_accuracy <- calcPredictionAccuracy(x = recomcv, data = getData(eval_sch, "unknown"), given=10, goodRating=3, byUser = FALSE)
head(eval_accuracy)

##           TP           FP           FN           TN    precision       recall 
##   3.56250000   6.43750000  84.38541667 331.61458333   0.35625000   0.05086994

Evaluate recommender model depending on the number of items (movies) recommended for every user (multiples of 5 up to 20)

results <- evaluate(x = eval_sch,method = "SVD",n = seq(0,20,5))

## SVD run fold/sample [model time/prediction time]
##   1  [0.02sec/0.051sec] 
##   2  [0.028sec/0.077sec] 
##   3  [0.028sec/0.291sec] 
##   4  [0.016sec/0.048sec]

head(getConfusionMatrix(results)[[1]])

##          TP        FP       FN       TN precision     recall        TPR
## 0  0.000000  0.000000 87.94792 338.0521       NaN 0.00000000 0.00000000
## 5  2.072917  2.927083 85.87500 335.1250 0.4145833 0.03051042 0.03051042
## 10 3.562500  6.437500 84.38542 331.6146 0.3562500 0.05086994 0.05086994
## 15 4.958333 10.041667 82.98958 328.0104 0.3305556 0.06710692 0.06710692
## 20 6.635417 13.364583 81.31250 324.6875 0.3317708 0.08694744 0.08694744
##            FPR
## 0  0.000000000
## 5  0.008311544
## 10 0.018409196
## 15 0.028867495
## 20 0.038470050

data-612-Project3

Ashish Kumar

06/24/2020