Introduction and Summary of Data Set
The following describes my path toward building a movie recommendation system that ultimately resulted in the Shiny app located here: Movie Recommendations on Shiny
This data set (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. These data were created by 671 users between January 09, 1995 and October 16, 2016. This data set was generated on October 17, 2016.
Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.
The data are contained in the files links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.
This is a development data set. As such, it may change over time and is not an appropriate data set for shared research results. See available benchmark data sets if that is your intent.
This and other GroupLens data sets are publicly available for download at http://grouplens.org/data sets/.
recommenderlab
To work with recommenderlab package, data firstly is needed to be converted to sparse format. A sparse matrix is a matrix where most of the elements are zero. In the case of our data set, while there are many users and many movies, the number of user/movie ratings is relatively few.
# transforming numeric IDs into strings so that sparseMatrix function does not fill in missing
# ID numbers and thus preserving correct dimensions
i = paste0('u', ratings$userId)
j = paste0('m', ratings$movieId)
x = ratings$rating
df = data.frame(i, j, x, stringsAsFactors = T)
# interesting that as.integer works on character vector
sparse_matrix = sparseMatrix(as.integer(df$i), as.integer(df$j), x = df$x)
colnames(sparse_matrix) = levels(df$j)
rownames(sparse_matrix) = levels(df$i)
# create recommenderLab real rating object
real_ratings = new('realRatingMatrix', data = sparse_matrix)
Most Popular
For a given user who has not yet rated an item, the most popular approach will predict a rating based on the average rating for that item based on those who have rated it. Thus, all users who have not rated said item will receive the same predicted rating.
# create Recommender object for popular model
model_popular = Recommender(real_ratings, method = 'POPULAR', param = list(normalize = 'center'))
# create prediction object
pred_popular = predict(model_popular, real_ratings[1:5], type = 'ratings')
as(pred_popular, 'matrix')[, 1:5]
m1 m10 m100 m100017 m100032
u1 2.775976 2.448398 2.345312 1.329026 1.928235
u10 3.921628 3.594050 3.490964 2.474678 3.073887
u100 NA 3.298398 3.195312 2.179026 2.778235
u101 4.125976 3.798398 3.695312 2.679026 3.278235
u102 4.200902 3.873324 3.770238 2.753952 3.353162
# evaluate accuracy of popular model
e_popular = evaluationScheme(real_ratings, method = 'split', train = 0.8, given = -5)
mode_popular = Recommender(getData(e_popular, 'train'), method = 'POPULAR', param = list(normalize = 'center'))
pred_popular = predict(model_popular, getData(e_popular, 'known'), type = 'ratings')
rmse_popular = calcPredictionAccuracy(pred_popular, getData(e_popular, 'unknown'))
rmse_popular
RMSE MSE MAE
0.9248621 0.8553699 0.6941487
User-Based Collaborative Filtering (UBCF)
Again for a given user who has not yet rated an item, UBCF is based on identifying the most similar users to our particular user. It then calculates the average rating (or averaged rating weighted by user similarity) of said item that these similar users have assigned to it.
Various similarity measures can be used, e.g. Pearson correlation, Cosine similarity, etc. Also, the n number of similar users can be optimized via cross-validation or similar approaches.
Because UBCF depends on user similarity, it must recalculate item ratings whenever the user’s preferences change. As a result, it can be too slow for to make real time item recommendations when the number of users and items is substantial.
# create Recommender object for ubcf model
model_ubcf = Recommender(real_ratings, method = 'UBCF', param = list(normalize = 'center'))
# create prediction object
pred_ubcf = predict(model_ubcf, real_ratings[1:5], type = 'ratings')
as(pred_ubcf, 'matrix')[, 1:5]
m1 m10 m100 m100017 m100032
u1 2.559320 2.544006 2.550000 2.550000 2.550000
u10 3.787169 3.626540 3.653646 3.695652 3.695652
u100 NA 3.350874 3.354726 3.400000 3.400000
u101 4.045674 3.905234 3.900000 3.900000 3.900000
u102 4.073754 3.863099 3.958329 3.974926 3.974926
# evaluate accuracy of ubcf model
e_ubcf = evaluationScheme(real_ratings, method = 'split', train = 0.8, given = -5)
mode_ubcf = Recommender(getData(e_ubcf, 'train'), method = 'ubcf', param = list(normalize = 'center'))
pred_ubcf = predict(model_ubcf, getData(e_ubcf, 'known'), type = 'ratings')
rmse_ubcf = calcPredictionAccuracy(pred_ubcf, getData(e_ubcf, 'unknown'))
rmse_ubcf
RMSE MSE MAE
0.7017058 0.4923910 0.5312428
Item-Based Collaborative Filtering (IBCF)
Similar to UBCF, IBCF is also based on similarity. But in the case of IBCF, the similarity in question is the similarity between items. IBCF calculates a similarity measure for all items based on existing user ratings. Then as a user peruses a new item, the algorithm can recommend similar items.
Again similar to UBCF, IBCF can use similar similarity measures and optimization techniques to determine the n number of items to use.
Unlike UBCF, however, the resource-intensive process of calculating similarity between items can be done offline. This allows IBCF to be much more responsive in making item recommendations to users and can be utilized in a real time setting. Amazon uses a custom IBCF approach.
With the full data set, the process takes hours to run, so I’ve not included the output here.
# create Recommender object for ibcf model
#model_ibcf = Recommender(real_ratings, method = 'IBCF', param = list(normalize = 'center'))
# create prediction object
#pred_ibcf = predict(model_ibcf, real_ratings[1:5], type = 'ratings')
#as(pred_ibcf, 'matrix')[, 1:5]
# evaluate accuracy of ibcf model
#e_ibcf = evaluationScheme(real_ratings, method = 'split', train = 0.8, given = -5)
#mode_ibcf = Recommender(getData(e_ibcf, 'train'), method = 'ibcf', param = list(normalize = 'center'))
#pred_ibcf = predict(model_ibcf, getData(e_ibcf, 'known'), type = 'ratings')
#rmse_ibcf = calcPredictionAccuracy(pred_ibcf, getData(e_ibcf, 'unknown'))
#rmse_ibcf
Singular Value Decomposition (SVD)
Singular value decomposition is essentially trying to reduce a rank R matrix to a rank K matrix by taking a list of R unique vectors and approximating them as a linear combination of K unique vectors (Quora).
# create Recommender object for svd model
model_svd = Recommender(real_ratings, method = 'SVD', param = list(normalize = 'center'))
# create prediction object
pred_svd = predict(model_svd, real_ratings[1:5], type = 'ratings')
as(pred_svd, 'matrix')[, 1:5]
m1 m10 m100 m100017 m100032
u1 2.543127 2.550196 2.550261 2.549503 2.549747
u10 3.726763 3.683090 3.689763 3.693689 3.694652
u100 NA 3.377720 3.394836 3.400426 3.400217
u101 4.045254 3.860013 3.884195 3.898862 3.899420
u102 4.418523 3.667622 3.850810 3.955026 3.964792
# evaluate accuracy of svd model
e_svd = evaluationScheme(real_ratings, method = 'split', train = 0.8, given = -5)
mode_svd = Recommender(getData(e_svd, 'train'), method = 'svd', param = list(normalize = 'center'))
pred_svd = predict(model_svd, getData(e_svd, 'known'), type = 'ratings')
rmse_svd = calcPredictionAccuracy(pred_svd, getData(e_svd, 'unknown'))
rmse_svd
RMSE MSE MAE
0.9047596 0.8185900 0.7085536
Algorithm Comparison
# combine rmse outputs
comparison = rbind(rmse_popular, rmse_ubcf, rmse_svd)
comparison = data.frame(comparison, row.names = NULL)
comparison = cbind(model = c('popular', 'ubcf', 'svd'), comparison)
comparison %>% gather('measure', 'value', -1) %>%
ggplot(aes(x = measure, y = value, fill = model)) +
geom_bar(stat = 'identity', position = position_dodge())

In this particular instance, the UBCF model produces smaller errors than the other models on the test data. Since we are not using too much data, we will use UBCF to build our recommendation engine down below.
Recommend Movies
Static Movie Recommendations
In the example below, I’ve added some movies and ratings and generated some movie recommendations using UBCF.
# find movies based on genre and year
#movies %>% filter(str_detect(genres, "Animation") & year == 2014)
# create custom user ratings
custom_ratings_df = data.frame(title = c('The Secret Life of Pets (2016)', 'Kung Fu Panda 3 (2016)', 'Zootopia (2016)', 'Inside Out (2015)',
'Minions (2015)', 'The Good Dinosaur (2015)', 'Hotel Transylvania 2 (2015)', 'The Lego Movie (2014)',
'Mr. Peabody & Sherman (2014)', 'How to Train Your Dragon 2 (2014)', 'Big Hero 6 (2014)',
'Song of the Sea (2014)', 'Paperman (2012)', 'Grand Budapest Hotel, The (2014)',
"King's Speech, The (2010)", 'How to Train Your Dragon (2010)', 'Avengers, The (2012)',
'The Imitation Game (2014)'),
rating = c(3.5, 3.5, 5.0, 4.0, 3.0, 1.0, 5.0, 1.0, 1.0, 4.5, 5.0, 5.0, 5.0, 5.0, 5.0, 4.5, 1.0, 4.0))
# rating = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))
# add movieId
custom_ratings = custom_ratings_df %>% left_join(movies, by = 'title') %>%
mutate(i = 'uCustom', j = paste0('m', movieId), x = rating) %>% select(i, j, x)
joining character vector and factor, coercing into character vector
#custom_ratings
custom_df = rbind(df, custom_ratings)
custom_sparse_matrix = sparseMatrix(as.integer(custom_df$i), as.integer(custom_df$j), x = custom_df$x)
colnames(custom_sparse_matrix) = levels(custom_df$j)
rownames(custom_sparse_matrix) = levels(custom_df$i)
# check custom user ratings
check = data.frame(custom_sparse_matrix[custom_sparse_matrix@Dimnames[[1]] == 'uCustom',])
check$movieId = rownames(check)
colnames(check)[1] = 'rating'
#check[check$rating != 0,]
# create real rating object
custom_real_ratings = new('realRatingMatrix', data = custom_sparse_matrix)
# make prediction using ubcf model
custom_ubcf = predict(model_ubcf, n = 20, custom_real_ratings)
custom_ubcf = as(custom_ubcf, 'list')$uCustom
custom_ubcf = data.frame(rank = 1:10, movieId = as.integer(str_replace(custom_ubcf, 'm', '')))
custom_ubcf %>% left_join(movies, by = 'movieId')
Shiny App for Movie Recommendations
The method above of entering movies and ratings is a bit cumbersome, so I created a Shiny implementation of the code above to make the process more seamless.
It can be found here: Movie Recommendations on Shiny
