Project 3
Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system. You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found. SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).
I chose ml-latest-small dataset from movielens. This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.
Ratings Data File Structure (ratings.csv)
All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
userId,movieId,rating,timestamp
Movies Data File Structure (movies.csv)
Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:
movieId,title,genres
Download Data
system("wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip -p -k --random-wait")
system("unzip -f ml-latest-small.zip")
movies <- read.csv('ml-latest-small/movies.csv')
ratings_data <- read.csv('ml-latest-small/ratings.csv')
#library(dplyr)
library(tidyr)
library(recommenderlab)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loading required package: arules
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
Data Exploration
## userId movieId rating timestamp
## Min. : 1.0 Min. : 1 Min. :0.500 Min. :8.281e+08
## 1st Qu.:177.0 1st Qu.: 1199 1st Qu.:3.000 1st Qu.:1.019e+09
## Median :325.0 Median : 2991 Median :3.500 Median :1.186e+09
## Mean :326.1 Mean : 19435 Mean :3.502 Mean :1.206e+09
## 3rd Qu.:477.0 3rd Qu.: 8122 3rd Qu.:4.000 3rd Qu.:1.436e+09
## Max. :610.0 Max. :193609 Max. :5.000 Max. :1.538e+09
## userId movieId rating timestamp
## 1 1 1 4 964982703
## 2 1 3 4 964981247
## 3 1 6 4 964982224
## 4 1 47 5 964983815
## 5 1 50 5 964982931
## 6 1 70 3 964982400
## movieId title
## Min. : 1 Confessions of a Dangerous Mind (2002): 2
## 1st Qu.: 3248 Emma (1996) : 2
## Median : 7300 Eros (2004) : 2
## Mean : 42200 Saturn 3 (1980) : 2
## 3rd Qu.: 76232 War of the Worlds (2005) : 2
## Max. :193609 ¡Three Amigos! (1986) : 1
## (Other) :9731
## genres
## Drama :1053
## Comedy : 946
## Comedy|Drama : 435
## Comedy|Romance: 363
## Drama|Romance : 349
## Documentary : 339
## (Other) :6257
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
Build a user matrix with movies as columns
Convert into a recommenderlab sparse matrix
recommender_matrix <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
names(recommender_matrix)
## [1] "HYBRID_realRatingMatrix" "ALS_realRatingMatrix"
## [3] "ALS_implicit_realRatingMatrix" "IBCF_realRatingMatrix"
## [5] "LIBMF_realRatingMatrix" "POPULAR_realRatingMatrix"
## [7] "RANDOM_realRatingMatrix" "RERECOMMEND_realRatingMatrix"
## [9] "SVD_realRatingMatrix" "SVDF_realRatingMatrix"
## [11] "UBCF_realRatingMatrix"
## $HYBRID_realRatingMatrix
## [1] "Hybrid recommender that aggegates several recommendation strategies using weighted averages."
##
## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
##
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
##
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
##
## $LIBMF_realRatingMatrix
## [1] "Matrix factorization with LIBMF via package recosystem (https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html)."
##
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
##
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
##
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
##
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
##
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend (https://sifter.org/~simon/journal/20061211.html)."
##
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."
SVD Parameters
## $k
## [1] 10
##
## $maxiter
## [1] 100
##
## $normalize
## [1] "center"
Determine similarity between users First 4 users
similarity_users <- similarity(rating_mat[1:4, ], method = "cosine", which = "users")
as.matrix(similarity_users)
## 1 2 3 4
## 1 0.0000000 1 0.7919033 0.9328096
## 2 1.0000000 0 NA 1.0000000
## 3 0.7919033 NA 0.0000000 1.0000000
## 4 0.9328096 1 1.0000000 0.0000000
Determine similarity between items First 4 Movies
similarity_items <- similarity(rating_mat[, 1:4], method = "cosine", which ="items")
as.matrix(similarity_items)
## 1 2 3 4
## 1 0.0000000 0.9644641 0.9715415 0.9838699
## 2 0.9644641 0.0000000 0.9389013 0.9609877
## 3 0.9715415 0.9389013 0.0000000 1.0000000
## 4 0.9838699 0.9609877 1.0000000 0.0000000
Explore ratings_data distribution
## vector_ratings_data
## 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
## 5830804 1370 2811 1791 7551 5550 20047 13136 26818 8551
## 5
## 13211
## [1] 4.0 0.0 4.5 2.5 3.5 3.0 5.0 0.5 2.0 1.5 1.0
Explore movie performance
views_per_movie <- colCounts(rating_mat) # count views for each movie
table_views <- data.frame(movie = names(views_per_movie),views = views_per_movie)
table_views <- table_views[order(table_views$views, decreasing = TRUE), ] # sort by number of views
table_views$title <- NA
head(table_views)
## movie views title
## 356 356 329 NA
## 318 318 317 NA
## 296 296 307 NA
## 593 593 279 NA
## 2571 2571 278 NA
## 260 260 251 NA
for (i in 1:nrow(table_views)){
table_views[i,3] <- as.character(subset(movies, movies$movieId == table_views[i,1])$title)
}
head(table_views)
## movie views title
## 356 356 329 Forrest Gump (1994)
## 318 318 317 Shawshank Redemption, The (1994)
## 296 296 307 Pulp Fiction (1994)
## 593 593 279 Silence of the Lambs, The (1991)
## 2571 2571 278 Matrix, The (1999)
## 260 260 251 Star Wars: Episode IV - A New Hope (1977)
Consider only movies with total of views higher than 50 views
average_ratings_data <- colMeans(rating_mat)
average_ratings_data_relevant <- average_ratings_data[views_per_movie > 50]
Only 436 movies have more than 50 views
Consider movies for a Minimum of 50 users per rates movie and 50 views per movie.
ratings_data_relevant <- rating_mat[rowCounts(rating_mat) > 50, colCounts(rating_mat) > 50]
ratings_data_relevant
## 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.
vector_ratings_data_relevant <- as.vector(ratings_data_relevant@data)
table(vector_ratings_data_relevant)
## vector_ratings_data_relevant
## 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
## 128594 322 694 367 1833 1479 6279 4605 10552 3742 6341
Defining Train and Test data sets
Normalize data
Create Recommender Model. Based on SVD approximation
## Warning in .local(x, ...): x was already normalized by row!
# Top 10 recommendations for users (1-10)
recom <- predict(recommender_model, newdata=test_ratings_data, n=10, type="topNList")
## Warning in .local(x, ...): x was already normalized by row!
recom_list <- as(recom, "list")
recom_result <- list()
for (i in c(1:10)){
recom_result[[i]] <- movies[as.integer(recom_list[[i]]),2]
}
#library(knitr)
recom_result_df <- as.data.frame(recom_result)
colnames(recom_result_df) <- seq(1,10,1)
head(recom_result_df)
## 1 2
## 1 Full Metal Jacket (1987) Pat and Mike (1952)
## 2 Phantoms (1998) Dr. Dolittle (1998)
## 3 If Lucy Fell (1996) Aristocrats, The (2005)
## 4 Monster in a Box (1992) Usual Suspects, The (1995)
## 5 Haunted World of Edward D. Wood Jr., The (1996) Jupiter's Wife (1994)
## 6 Mighty Aphrodite (1995) Pete's Dragon (1977)
## 3
## 1 Great White Hype, The (1996)
## 2 City Slickers II: The Legend of Curly's Gold (1994)
## 3 Dr. Dolittle (1998)
## 4 Fierce Creatures (1997)
## 5 Kiss Me, Guido (1997)
## 6 Sgt. Bilko (1996)
## 4
## 1 Fog, The (1980)
## 2 Star Wars: Episode VI - Return of the Jedi (1983)
## 3 Out Cold (2001)
## 4 Kiss the Girls (1997)
## 5 Dead Man Walking (1995)
## 6 Dobermann (1997)
## 5
## 1 Wonderland (1999)
## 2 Welcome to Collinwood (2002)
## 3 Cop Land (1997)
## 4 Flesh & Blood (1985)
## 5 Cat People (1982)
## 6 Kissed (1996)
## 6
## 1 NeverEnding Story III, The (1994)
## 2 Great White Hype, The (1996)
## 3 City Slickers II: The Legend of Curly's Gold (1994)
## 4 Bread and Chocolate (Pane e cioccolata) (1973)
## 5 Angels in the Outfield (1994)
## 6 Age of Innocence, The (1993)
## 7
## 1 Time Masters (Maîtres du temps, Les) (1982)
## 2 Kiss Me, Guido (1997)
## 3 Kid in King Arthur's Court, A (1995)
## 4 Snatch (2000)
## 5 Whole Wide World, The (1996)
## 6 Twelve Chairs, The (1970)
## 8
## 1 Time Masters (Maîtres du temps, Les) (1982)
## 2 Monsters (2010)
## 3 Vanya on 42nd Street (1994)
## 4 Conspiracy Theory (1997)
## 5 Welcome to Collinwood (2002)
## 6 <NA>
## 9
## 1 What's Eating Gilbert Grape (1993)
## 2 Addams Family Values (1993)
## 3 Wonderland (1999)
## 4 Bread and Chocolate (Pane e cioccolata) (1973)
## 5 Great White Hype, The (1996)
## 6 Nightmare on Elm Street, A (1984)
## 10
## 1 Welcome to Collinwood (2002)
## 2 Children of a Lesser God (1986)
## 3 Thumbsucker (2005)
## 4 Dobermann (1997)
## 5 Kiss the Girls (1997)
## 6 Casino (1995)
Ratings assigned to the movies
## Warning in .local(x, ...): x was already normalized by row!
## 1 2 3 6 7
## [1,] 3.784563 3.880402 3.845442 3.731974 3.833721
## [2,] 3.603780 3.672544 3.656996 3.611870 3.644504
## [3,] 4.305907 4.323836 4.323536 4.297794 4.327453
## [4,] 3.255318 3.066650 3.199115 3.280091 3.095823
## [5,] 3.767017 3.527568 3.620133 3.955714 3.620279
Evaluating Model with Cross-validation
eval_sch <- evaluationScheme(ratings_data_relevant, method="cross-validation", k=4, given=10, goodRating=3)
recommender_model <- Recommender(getData(eval_sch,"train"), method = "SVD", param=list(k=10,maxiter=100,normalize="center"))
recomcv <- predict(recommender_model, newdata=getData(eval_sch,"known"), n=10, type="topNList")
Performance index of the whole model
eval_accuracy <- calcPredictionAccuracy(x = recomcv, data = getData(eval_sch, "unknown"), given=10, goodRating=3, byUser = FALSE)
head(eval_accuracy)
## TP FP FN TN precision recall
## 3.56250000 6.43750000 84.38541667 331.61458333 0.35625000 0.05086994
Evaluate recommender model depending on the number of items (movies) recommended for every user (multiples of 5 up to 20)
## SVD run fold/sample [model time/prediction time]
## 1 [0.02sec/0.051sec]
## 2 [0.028sec/0.077sec]
## 3 [0.028sec/0.291sec]
## 4 [0.016sec/0.048sec]
## TP FP FN TN precision recall TPR
## 0 0.000000 0.000000 87.94792 338.0521 NaN 0.00000000 0.00000000
## 5 2.072917 2.927083 85.87500 335.1250 0.4145833 0.03051042 0.03051042
## 10 3.562500 6.437500 84.38542 331.6146 0.3562500 0.05086994 0.05086994
## 15 4.958333 10.041667 82.98958 328.0104 0.3305556 0.06710692 0.06710692
## 20 6.635417 13.364583 81.31250 324.6875 0.3317708 0.08694744 0.08694744
## FPR
## 0 0.000000000
## 5 0.008311544
## 10 0.018409196
## 15 0.028867495
## 20 0.038470050