The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches.
For assignment 2, start with an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester [http://eigentaste.berkeley.edu/dataset/] or another dataset of your choosing. Implement at least two of these recommendation algorithms:
• Content-Based Filtering
• User-User Collaborative Filtering
• Item-Item Collaborative Filtering
As an example of implementing a Content-Based recommender, you could build item profiles for a subset of MovieLens movies from scraping http://www.imdb.com/ or using the API at https://www.omdbapi.com/ (which has very recently instituted a small monthly fee). A more challenging method would be to pull movie summaries or reviews and apply tf-idf and/or topic modeling.
You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities.
You may use the course text’s recommenderlab or any other library that you want.
Please provide at least one graph, and a textual summary of your findings and recommendations.
We will use MovieLense data set from recommenderLab. This is the full version found here at https://grouplens.org/datasets/movielens.
So let’s read our data and hit it with all the necessary transformations.
data(MovieLense)
movieMatrix<-MovieLense
dim(movieMatrix)
## [1] 943 1664
# We could also loaded our data from the MovieLense data set found at https://grouplens.org/datasets/movielens
# ratings = read.csv("https://raw.githubusercontent.com/theoracley/Data612/master/Project2/ratings.csv")
#
# #wide format transformation
# data<-ratings%>% select (movieId, userId, rating) %>% spread (movieId,rating)
#
# #converting our data into a rating Matrix
# movieMatrix <- as(as.matrix(data[-c(1)]), "realRatingMatrix")
our Dataset contains 943 Users and 1664 Movies
first let’s look at some examples data from the data set
# check out our Matrix as well as the realRatingMatrix
str(movieMatrix)
## Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
## ..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. ..@ i : int [1:99392] 0 1 4 5 9 12 14 15 16 17 ...
## .. .. ..@ p : int [1:1665] 0 452 583 673 882 968 994 1386 1605 1904 ...
## .. .. ..@ Dim : int [1:2] 943 1664
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : chr [1:943] "1" "2" "3" "4" ...
## .. .. .. ..$ : chr [1:1664] "Toy Story (1995)" "GoldenEye (1995)" "Four Rooms (1995)" "Get Shorty (1995)" ...
## .. .. ..@ x : num [1:99392] 5 4 4 4 4 3 1 5 4 5 ...
## .. .. ..@ factors : list()
## ..@ normalize: NULL
head(movieMatrix@data [1:5,1:5])
## 5 x 5 sparse Matrix of class "dgCMatrix"
## Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995)
## 1 5 3 4 3
## 2 4 . . .
## 3 . . . .
## 4 . . . .
## 5 4 3 . .
## Copycat (1995)
## 1 3
## 2 .
## 3 .
## 4 .
## 5 .
# check out the the rating from user1 and user100
movieMatrix@data[1,1:5]
## Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995)
## 5 3 4 3
## Copycat (1995)
## 3
movieMatrix@data[100, 1:5]
## Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995)
## 0 0 0 0
## Copycat (1995)
## 0
#Let's pick user900 and see what rating he/she has for 10 column 1500 to 1509
movieMatrix@data[900,1500:1509]
## Mad Dog Time (1996) Children of the Revolution (1996)
## 0 0
## World of Apu, The (Apur Sansar) (1959) Sprung (1997)
## 0 0
## Dream With the Fishes (1997) Wings of Courage (1995)
## 0 0
## Wedding Gift, The (1994) Race the Sun (1996)
## 0 0
## Losing Isaiah (1995) New Jersey Drive (1995)
## 0 0
#length(movieMatrix@data[movieMatrix@data[,]>0])
let’s check out the distribution
# the first matrix portion of 100X100 of our big matrix reveals where the users have actually given a rating
length(movieMatrix@data[100,][movieMatrix@data[100,] > 0])
## [1] 56
# Total number of ratings provided by the users
nratings(movieMatrix)
## [1] 99392
#we could also Used this to get the count of ratings above 0, but its not optimized
#length(movieMatrix@data[movieMatrix@data[,]>0])
# Check out the rating distribution
hist(getRatings(movieMatrix), main = "Distribution Of Ratings", xlim=c(0,5), breaks="FD")
It looks like only 56% rating was done in the first 100X100 portion of our MovieMatrix. Also according to this distribution, on that portion, the most popular rating was 4.
Let’s check out the heatmap
# heatmap of the rating matrix
image(movieMatrix, main = "Heatmap of the rating matrix")
To get the best model, we should select the most relevant data. Because of luck of data, movies that have been selected fewer times will have some bias. the same case for users who have rated few movies, their ratings could be bias too. because normally, we should do some transformations to bring the data from its original form which is not normalized to a better form where that data is normalized. Having said that, we are using the recommenderlab
package which will do the normalization for us by default. No need to do it ourselves, this package will take care of it. We will also split our dataset as usual for modeling purposes.
# selecting the most relevant data
ratingsMovies <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100]
# spliting the dataset
training <- sample(x = c(TRUE, FALSE), size = nrow(ratingsMovies), replace = TRUE, prob = c(0.8, 0.2))
train <- ratingsMovies[training, ]
test <- ratingsMovies[!training, ]
users will prefer those products similar to ones they have already rated
based on measuring the similarity between items
this method explorers the relationship between items
for each item top n items are stored (rather then storing all the items for an efficiency purposes) based on similarity measures (Cosine or Pearson). Weighted sum is used to finally make recommendation for user.
We will look at the similarity between items, then will build our item-based model where we will recommend movies to users.
# compute the item similarity matrix
similarity_items <- similarity(MovieLense[, 1:4], method = "cosine", which = "items")
# visualize the item similarity matrix
image(as.matrix(similarity_items), main = "Items similarities")
The yellow diagonal means that each item is similar to itself.More the cell is red, more the items are similar to each other.
Let’s predict the movies for users
# build recommender model
recommender_m1 <- Recommender(train, method = "IBCF", parameter = list(k = 30))
model_details1 <- getModel(recommender_m1)
# prediction
recommender_predict1 <- predict(object = recommender_m1, newdata = test, n = 6)
recommender_matrix1 <- sapply(recommender_predict1@items, function(x) {colnames(ratingsMovies)[x]})
recommender_matrix1[, 1:3] %>% kable() %>% kable_styling(full_width = T)
5 | 6 | 12 |
---|---|---|
Maverick (1994) | Devil’s Own, The (1997) | Crimson Tide (1995) |
Net, The (1995) | Alien: Resurrection (1997) | Ed Wood (1994) |
Devil’s Advocate, The (1997) | Nightmare on Elm Street, A (1984) | Hoop Dreams (1994) |
What’s Eating Gilbert Grape (1993) | Peacemaker, The (1997) | Professional, The (1994) |
Cape Fear (1991) | Happy Gilmore (1996) | Quiz Show (1994) |
In the Line of Fire (1993) | Cinderella (1950) | Searching for Bobby Fischer (1993) |
Similar users will have similar movie tastes
based on measuring the similarity between users
It is a memory based model as loads whole rating matrix into memory
User-based collaborative filtering is a two-step process: first step is the finding for a given user his neighbours (using similarity measures such as Pearson coefficient or Cosine distance). For item not rated by user, we use average rating of that item of user’s neighbours.
# compute the user similarity matrix
similarity_users <- similarity(MovieLense[1:4, ], method = "pearson", which = "users")
# visualize the user similarity matrix
image(as.matrix(similarity_users), main = "Users similarities")
recommender_m2 <- Recommender(train, method = "UBCF", parameter = list(k = 25))
## Warning: Unknown parameter: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## weighted = TRUE
## normalize = center
## min_matching_items = 0
## min_predictive_items = 0
## verbose = FALSE
The yellow diagonal means the user has the same similarity to itself. the darker the cell is, the more the 2 users have similar taste.
Let predict with our user-based recommender
model_details2 <- getModel(recommender_m2)
recommender_predict2 <- predict(object = recommender_m2, newdata = test, n = 6)
recommender_matrix2 <- sapply(recommender_predict2@items, function(x) {colnames(ratingsMovies)[x]})
recommender_matrix2[, 1:3] %>% kable() %>% kable_styling(full_width = T)
5 | 6 | 12 |
---|---|---|
Gone with the Wind (1939) | Hoop Dreams (1994) | Blues Brothers, The (1980) |
Godfather: Part II, The (1974) | Happy Gilmore (1996) | Shawshank Redemption, The (1994) |
My Left Foot (1989) | Strictly Ballroom (1992) | Blade Runner (1982) |
Lawrence of Arabia (1962) | Adventures of Priscilla, Queen of the Desert, The (1994) | Reservoir Dogs (1992) |
Shawshank Redemption, The (1994) | Amistad (1997) | Alien (1979) |
Terminator 2: Judgment Day (1991) | Rear Window (1954) | Babe (1995) |