Content-based and Collaborative Filtering:

This project extends and builds upon the previous submission that constructed a recommended system example with data from MovieLens linked from grouplens.org. Although MovieLens is available within the recommenderlab package, I opted to manually prepare the data before moving to recommenderlab for analysis. Many steps using recommenderlab follow along Chapter 3 of Recommendation Systems in R by S. Gorakala and M. Usuelli.

Data Exploration:

After reading in the data locally, I review the head of the ratings variable that contains the core of the data used for this assignment.

ratings <- read.csv("ratings.csv", header = T)
links <- read.csv("links.csv", header = T, stringsAsFactors = F)
movies <- read.csv("movies.csv", header = T, stringsAsFactors = F) %>% 
  left_join(links, by = "movieId") %>% select(-tmdbId)
# http://www.omdbapi.com/?i=tt3896198&apikey=8effeab3
# object.size(ratings)
head(ratings)

##   userId movieId rating  timestamp
## 1      1       2    3.5 1112486027
## 2      1      29    3.5 1112484676
## 3      1      32    3.5 1112484819
## 4      1      47    3.5 1112484727
## 5      1      50    3.5 1112484580
## 6      1     112    3.5 1094785740

Next, I’ll arrange the data by userId and review a histogram of the average user rating.

users <- ratings %>% select(userId, rating) %>% group_by(userId) %>% 
  summarise(Qty_Rated = n(), 
            Avg_rating = mean(rating),
            Median_rating = median(rating)) %>% 
  arrange(desc(Qty_Rated))

users %>% 
  ggplot(aes(x = Avg_rating))+ geom_histogram(bins = 40)+ 
  ggtitle("Ratings: User-averages") + theme_classic()

As shown,

Next, I’ll review the same histogram from the perspective of movies:

# I'll take out movies that are not rated. 
items <- ratings %>% select(movieId, rating) %>% filter(rating > 0) %>% 
  group_by(movieId) %>% 
  summarise(Qty_Rated = n(), 
            Avg_rating = mean(rating),
            Median_rating = median(rating)) %>% 
  arrange(desc(Qty_Rated))

items %>% 
  ggplot(aes(x = Avg_rating))+ geom_histogram(bins = 20) + 
  ggtitle("Ratings: Movie-averages") +  theme_classic()

Next I’ll complete a review of the movie data that contains 27K+ movie listings and get the counts by genre. This requires that I binarize the data using the cSplit function:

# splitstack shape's cSplit to create multiple 
# columns from a seperated string column
g_movies <- movies %>% mutate("g" = genres) %>% select(-genres) %>% 
  cSplit_e("g", sep = "|", type = "character", fill = 0, drop = F)

g_movies %>% 
  gather(-c(movieId, title, imdbId, g), 
         key = "genre", 
         value = "m_count") %>%
  ggplot(aes(x = genre, y = m_count)) +
  geom_bar(stat = "identity") +
  coord_flip()

As shown above, g_Drama, or movies that fall into (at least in part) the category of the Drama are most common among the movies followed by g_Comedy.

Test/Train split

Using a baseline minimum of 2500 ratings, I’ll narrow down the number of user-item pairs to reduce the size of the data I’m working with.

set.seed(1)
# get list of users that have rated greater than 100 movies
# and list of movies (items) have more than 100 ratings
my_users <- users %>% filter(Qty_Rated > 2500)
my_items <- items %>% filter(Qty_Rated > 2500)

my_ratings <- ratings %>% 
  filter(userId %in% my_users$userId,
         movieId %in% my_items$movieId) %>% 
  mutate(userId = as.factor(userId), 
         movieId = as.factor(movieId)) %>% 
  select(-timestamp)

# s_size <- floor(0.75 * nrow(my_ratings))            # 25% for test set
# train_index <- sample(seq_len(nrow(my_ratings)),
#                       size = s_size)

# xval contains the train and test data 70/30 split
xval <- evaluationScheme(as(my_ratings, "realRatingMatrix"), 
                         method = "split", 
                         train = 0.7, 
                         given = 3,
                         goodRating = 5)

xval

## Evaluation scheme with 3 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.700
## Good ratings: >=5.000000
## Data set: 117 x 1765 rating matrix of class 'realRatingMatrix' with 141164 ratings.

# train <- my_ratings[train_index, ]
# test <- my_ratings[-train_index, ]

# CHECK irlba() and svd()
# 117 rows 1766 cols

Then I’ll convert my user/item/rating triplets from train using recommenderlab’s realRatingMatrix:

Even with my heavy filtering, I’ve still got a fairly large matrix at 117 by 1765 with 100K ratings.

Before proceeding, I’ll review using the image function in recommenderlab - just a corner of the data as it’s too large to show properly.

image(getData(xval, "train")[c(1:50),c(1:50)], 
      main = "50x50 corner of the rating matrix")

Building Recommendation Models

The documentation within the recommenderlab is very good and examples within were referenced to complete the assignment.

Further, evaluationScheme in the package allows for easy training/testing for more sophisticated comparisons.

Below, I employ user-based and item-based collaborative filtering with 3 different methods.

IBCF:

I create the Recommender object using method = "IBCF".

rm_ibcf <- Recommender(data = getData(xval, "train"), 
                       method = "IBCF")
rm_ibcf

## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 81 users.

UBCF:

rm_ubcf <- Recommender(data = getData(xval, "train"),
                       method = "UBCF")
rm_ubcf

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 81 users.

Evaluating Recommendation Models

Below, I utilize the known portion of the evaluation data (i.e. the testing portion of the the combined training and testing data), to create prediction and evaluate the quality of the recommenders

p1 <- predict(rm_ibcf, getData(xval, "known"), type = "ratings")
p2 <- predict(rm_ubcf, getData(xval, "known"), type = "ratings")
p3 <- predict(rm_popu, getData(xval, "known"), type = "ratings")

err_eval <- rbind(
  IBCF = calcPredictionAccuracy(p1, getData(xval, "unknown")),
  UBCF = calcPredictionAccuracy(p2, getData(xval, "unknown")),
  POPU = calcPredictionAccuracy(p3, getData(xval, "unknown")))
err_eval

##           RMSE       MSE       MAE
## IBCF 1.3248249 1.7551611 1.0541250
## UBCF 0.9944207 0.9888726 0.7774783
## POPU 0.9937167 0.9874729 0.7724321

Above, I note that the POPULAR method of item-based collaborative filtering is the superior method given it’s gote the lowest RSME score.

Next, I’ll evaluate the rm_popu recommeder model via cross or cross-validation using the evaluationScheme function.

sch <- evaluationScheme(as(my_ratings, "realRatingMatrix"), 
                        method = "cross",
                        k = 4, # 4-fold cross validation scheme
                        given = 3,
                        goodRating=5)

# Next we use the created evaluation scheme to 
# evaluate the recommender method popular.
# We evaluate top-1, top-3, top-5, top-10, 
# top-15, and top-20 recommendation lists. 

# this will tell us how long it takes our 
# rec to serve up n recs:
results <- evaluate(sch, 
                    method = "POPULAR", 
                    type = "topNList",
                    n = c(1, 3, 5, 10, 15, 20))

## POPULAR run fold/sample [model time/prediction time]
##   1  [0.01sec/0.14sec] 
##   2  [0.02sec/0.12sec] 
##   3  [0.02sec/0.09sec] 
##   4  [0.01sec/0.11sec]

#dim(info_popu$ratings)
#rm_1info$description
#dim(info_ibcf$sim)
results

## Evaluation results for 4 folds/samples using method 'POPULAR'.

#image(info_popu$sim, main = "Heatmap of 50 rows and columns")

getConfusionMatrix(results)[[1]]

##          TP        FP       FN       TN precision     recall        TPR
## 1  0.600000  0.400000 73.50000 1687.500 0.6000000 0.01380942 0.01380942
## 3  1.433333  1.566667 72.66667 1686.333 0.4777778 0.02926711 0.02926711
## 5  2.000000  3.000000 72.10000 1684.900 0.4000000 0.03717866 0.03717866
## 10 3.733333  6.266667 70.36667 1681.633 0.3733333 0.07053509 0.07053509
## 15 5.066667  9.933333 69.03333 1677.967 0.3377778 0.09751754 0.09751754
## 20 6.466667 13.533333 67.63333 1674.367 0.3233333 0.12106992 0.12106992
##             FPR
## 1  0.0002336667
## 3  0.0009178757
## 5  0.0017591077
## 10 0.0036776987
## 15 0.0058376989
## 20 0.0079547292

plot(results, "prec/rec", annotate=TRUE)

Comparing Recommendation Models

# below, I follow along with the text to review
# how the models compare directly.
set.seed(12)
scheme <- evaluationScheme(as(my_ratings, "realRatingMatrix"), 
                           method = "split", 
                           train = .9,
                           k = 1, 
                           given = -5,      # all *but* 5   
                           goodRating = 5)  # are randomly selected.
scheme

## Evaluation scheme using all-but-5 items
## Method: 'split' with 1 run(s).
## Training set proportion: 0.900
## Good ratings: >=5.000000
## Data set: 117 x 1765 rating matrix of class 'realRatingMatrix' with 141164 ratings.

create list of algorithms to try out…

algorithms <- list(
  "random items" = list(name="RANDOM", param=NULL),
  "popular items" = list(name="POPULAR", param=NULL),
  "user-based CF" = list(name="UBCF", param=list(nn=50)),
  "item-based CF" = list(name="IBCF", param=list(k=50)),
  "SVD approximation" = list(name="SVD", param=list(k = 50))
  )

results <- evaluate(scheme, 
                    algorithms, 
                    type = "topNList", 
                    n=c(1, 3, 5, 10, 15, 20))

## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.03sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.02sec/0.06sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.16sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [18.94sec/0.03sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.23sec/0.03sec]

results

## List of evaluation results for 5 recommenders:
## Evaluation results for 1 folds/samples using method 'RANDOM'.
## Evaluation results for 1 folds/samples using method 'POPULAR'.
## Evaluation results for 1 folds/samples using method 'UBCF'.
## Evaluation results for 1 folds/samples using method 'IBCF'.
## Evaluation results for 1 folds/samples using method 'SVD'.

names(results)

## [1] "random items"      "popular items"     "user-based CF"    
## [4] "item-based CF"     "SVD approximation"

results[["user-based CF"]]

## Evaluation results for 1 folds/samples using method 'UBCF'.

# True-Postive Rate (TPR) vs False-Postive Rate (FPR)
#par(mfrow=c(1,2))
plot(results, annotate=c(1,3), legend="topleft")

Above, I note that SVD approximation is pretty darn good.

plot(results, "prec/rec", annotate=3, legend="topleft")

But upon reviewing the above, it appears my precision and recall are very low. These recommenders are not very good.

results <- evaluate(scheme, algorithms, type = "ratings")

## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.01sec/0sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.13sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [18sec/0.02sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.24sec/0.01sec]

Conclusion:

Among the 3 reccomendation methods attempted, the superior results came from the item-based POPULAR method from the recommenderlab package. I look forward to exploring additional features and attributes with this data.

References:

SO: Training/Test Splits with sample function
DF to matrix conversion on emilkirkegaard
[SO: Separate to Uknown Number of Columns] (https://stackoverflow.com/questions/33288695/how-to-use-tidyrseparate-when-the-number-of-needed-variables-is-unknown)
SO: Convert Pipe-delimited column into binary data
Alternating Least Squares

DATA643_P2

jbrnbrg

June 12, 2018