This project is based on the recommenderlab package as described in the Building a Recommendation System with R book by Gorakala and Usuelli. It presents basic item-item and user-user collaborative filtering.
# Required libraries
library(recommenderlab)
library(dplyr)
library(tidyr)
library(ggplot2)
The data set is courtesy of MovieLens project and it was downloaded from https://grouplens.org/datasets/movielens/.
# Data import
ratings <- read.csv(paste0("https://raw.githubusercontent.com/ilyakats/CUNY-DATA643/",
"master/Project%202/ml-latest-small/ratings.csv"))
titles <- read.csv(paste0("https://raw.githubusercontent.com/ilyakats/CUNY-DATA643/",
"master/Project%202/ml-latest-small/movies.csv"))
# Convert to matrix
movieMatrix <- ratings %>%
select(-timestamp) %>%
spread(movieId, rating)
row.names(movieMatrix) <- movieMatrix[,1]
movieMatrix <- movieMatrix[-c(1)]
movieMatrix <- as(as.matrix(movieMatrix), "realRatingMatrix")
movieMatrix
## 671 x 9066 rating matrix of class 'realRatingMatrix' with 100004 ratings.
Our movie matrix contains 671 users and 9,066 items/movies.
Let us take a quick look at the distribution of all ratings to make sure that there are no surprises. It seems that users favor whole number ratings. 4 is the most common rating which seems to be common for 5-star rating systems.
vRatings <- as.vector(movieMatrix@data)
vRatings <- vRatings[vRatings != 0]
ggplot() + aes(vRatings) +
geom_histogram(binwidth = 0.5) +
xlab("Rating") + ylab("No of Ratings")
It does not make sense to use sparse data to build our model, so we will select only users and items with the most information.
( movies <- movieMatrix[rowCounts(movieMatrix) > 50,
colCounts(movieMatrix) > 50] )
## 421 x 444 rating matrix of class 'realRatingMatrix' with 37915 ratings.
We are left with 421 users and 444 items. This is about two thirds of the original users and under 5% of the original items. However, this covers almost 38% of original ratings. It seems that there were a lot of items/movies with just a few ratings.
Any ratings matrix, especially movie ratings matrix, is bound to have some bias. Some users just give higher ratings than others. Consider average rating per user. We can see from the distribution plot below that it varies a lot.
avg <- rowMeans(movies)
ggplot() + aes(avg) +
geom_histogram(binwidth = 0.1) +
xlab("Average Rating") + ylab("No of Ratings")
recommenderlab normalizes the data when building a model. Let us normalize the ratings and confirm that all averages are 0 now to see what kind of effect it has.
moviesNorm <- normalize(movies)
avg <- round(rowMeans(moviesNorm),5)
table(avg)
## avg
## 0
## 421
minItems <- quantile(rowCounts(movies), 0.95)
minUsers <- quantile(colCounts(movies), 0.95)
image(movies[rowCounts(movies) > minItems,
colCounts(movies) > minUsers],
main = "Heatmap of the Top Users and Movies (Non-Normalized")
image(moviesNorm[rowCounts(moviesNorm) > minItems,
colCounts(moviesNorm) > minUsers],
main = "Heatmap of the Top Users and Movies (Normalized)")
Reviewing rows in two heatmaps above, we can see that after normalization, the average rating is more uniform. Visually it does appear that bias is reduced.
Let us split our set into training set (80%) and testing set (20%).
set.seed(88)
which_train <- sample(x = c(TRUE, FALSE), size = nrow(movies),
replace = TRUE, prob = c(0.8, 0.2))
movieTrain <- movies[which_train, ]
movieTest <- movies[!which_train, ]
Now let us create a model using the training set.
( model <- Recommender(movieTrain, method = "IBCF") )
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 338 users.
Interestingly, what can be the most time consuming and critical step - training the model - can be done in one line of code with recommenderlab package.
We can examine the similarity matrix and find top ten movies that are similar to other movies.
similarityMatrix <- getModel(model)$sim
which_max <- order(colSums(similarityMatrix > 0), decreasing = TRUE)[1:10]
topMovies <- as.data.frame(as.integer(rownames(similarityMatrix)[which_max]))
colnames(topMovies) <- c("movieId")
data <- topMovies %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Movie |
|---|
| Disclosure (1994) |
| Beverly Hills Cop III (1994) |
| City Slickers II: The Legend of Curly’s Gold (1994) |
| Grumpier Old Men (1995) |
| Eraser (1996) |
| Perfect Storm, The (2000) |
| Judge Dredd (1995) |
| Coneheads (1993) |
| Congo (1995) |
| Seven Samurai (Shichinin no samurai) (1954) |
( pred <- predict(model, newdata = movieTest, n = 6) )
## Recommendations as 'topNList' with n = 6 for 83 users.
Now we can extract recommendations. Consider the first user. Clearly a Star Wars fan and not a Trekkie.
# Movie ratings of the first user
user1 <- as.data.frame(movieTest@data[1,movieTest@data[1,]>0])
colnames(user1) <- c("Rating")
user1[c("movieId")] <- as.integer(rownames(user1))
data <- titles %>%
inner_join(user1, by = "movieId") %>%
select(Movie = "title", Rating) %>%
arrange(desc(Rating))
knitr::kable(data, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Movie | Rating |
|---|---|
| Braveheart (1995) | 5 |
| Star Wars: Episode IV - A New Hope (1977) | 5 |
| Shawshank Redemption, The (1994) | 5 |
| Wallace & Gromit: A Close Shave (1995) | 5 |
| Wallace & Gromit: The Wrong Trousers (1993) | 5 |
| Star Wars: Episode V - The Empire Strikes Back (1980) | 5 |
| Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981) | 5 |
| Star Wars: Episode VI - Return of the Jedi (1983) | 5 |
| Amadeus (1984) | 5 |
| Glory (1989) | 5 |
| Babe (1995) | 4 |
| Birdcage, The (1996) | 4 |
| True Lies (1994) | 4 |
| Jurassic Park (1993) | 4 |
| Blade Runner (1982) | 4 |
| Nightmare Before Christmas, The (1993) | 4 |
| Aladdin (1992) | 4 |
| Dances with Wolves (1990) | 4 |
| Snow White and the Seven Dwarfs (1937) | 4 |
| 2001: A Space Odyssey (1968) | 4 |
| Fish Called Wanda, A (1988) | 4 |
| Monty Python’s Life of Brian (1979) | 4 |
| Monty Python and the Holy Grail (1975) | 4 |
| Blues Brothers, The (1980) | 4 |
| Terminator, The (1984) | 4 |
| This Is Spinal Tap (1984) | 4 |
| Field of Dreams (1989) | 4 |
| Star Trek II: The Wrath of Khan (1982) | 4 |
| Toy Story (1995) | 3 |
| GoldenEye (1995) | 3 |
| Get Shorty (1995) | 3 |
| Happy Gilmore (1996) | 3 |
| Star Trek: Generations (1994) | 3 |
| Forrest Gump (1994) | 3 |
| Four Weddings and a Funeral (1994) | 3 |
| Lion King, The (1994) | 3 |
| Mask, The (1994) | 3 |
| Speed (1994) | 3 |
| Mrs. Doubtfire (1993) | 3 |
| Sleepless in Seattle (1993) | 3 |
| Terminator 2: Judgment Day (1991) | 3 |
| Batman (1989) | 3 |
| Beauty and the Beast (1991) | 3 |
| Truth About Cats & Dogs, The (1996) | 3 |
| Independence Day (a.k.a. ID4) (1996) | 3 |
| Die Hard (1988) | 3 |
| Willy Wonka & the Chocolate Factory (1971) | 3 |
| E.T. the Extra-Terrestrial (1982) | 3 |
| Princess Bride, The (1987) | 3 |
| Back to the Future (1985) | 3 |
| Young Frankenstein (1974) | 3 |
| Indiana Jones and the Last Crusade (1989) | 3 |
| When Harry Met Sally… (1989) | 3 |
| Star Trek III: The Search for Spock (1984) | 3 |
| Star Trek IV: The Voyage Home (1986) | 3 |
| Raising Arizona (1987) | 3 |
| Stargate (1994) | 2 |
| Eraser (1996) | 2 |
| Twister (1996) | 1 |
His/her recommendations are as follows. Of course, this evaluation is highly subjective, but I would only question the recommendation of Casper.
# Recommendations for the first user
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Movie |
|---|
| Ace Ventura: When Nature Calls (1995) |
| Casper (1995) |
| Congo (1995) |
| Judge Dredd (1995) |
| Clerks (1994) |
| Legends of the Fall (1994) |
The setup and code for the user based collaborative filtering is very similar to the item-based collaborative filtering above.
Now let us create a user-based model using the training set.
( model <- Recommender(movieTrain, method = "UBCF") )
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 338 users.
( pred <- predict(model, newdata = movieTest, n = 6) )
## Recommendations as 'topNList' with n = 6 for 83 users.
Again let us consider the first user and look at his/her recommendations. I would consider these as solid recommendations. The first user gravitated towards more critically acclaimed dramas and these recommendations are among the best movies produced.
# Recommendations for the first user
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Movie |
|---|
| Citizen Kane (1941) |
| Amelie (Fabuleux destin d’Amélie Poulain, Le) (2001) |
| Psycho (1960) |
| Eternal Sunshine of the Spotless Mind (2004) |
| 12 Angry Men (1957) |
| Airplane! (1980) |
Consider if we build the model without normalizing the data. Only one movie is featured on both lists of recommendations for the first user - Eternal Sunshine of the Spotless Mind. I would argue that without normalization recommendations include more usual suspects (movies very highly rated by majority of users) and therefore are more generic. This is, of course, highly subjective and needs to be researched and tested using more objective means than this writer’s opinion.
model <- Recommender(movieTrain, method = "UBCF", parameter = list(normalize = NULL))
pred <- predict(model, newdata = movieTest, n = 6)
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Movie |
|---|
| American Beauty (1999) |
| Pulp Fiction (1994) |
| Silence of the Lambs, The (1991) |
| Lord of the Rings: The Fellowship of the Ring, The (2001) |
| Eternal Sunshine of the Spotless Mind (2004) |
| Clockwork Orange, A (1971) |
This project presented the most basic approach to building a recommender system in R using the recommenderlab package. It describes some basic steps; however, it is important to note that this is a learning exercise. More development and testing would be needed for a usable recommender system. It would be particularly interesting to see the effect of additional features and to see the performance of the recommender when little information is known about user’s preference.