Collaborative Filtering

Definition

Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users.

Collaborative filtering, source : wikipedia

Collaborative filtering, source : wikipedia

Data Input

MovieLens Latest Datasets is available at https://grouplens.org/datasets/movielens/latest/

File Name : ml-latest-small.zip File Direct download URL : http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

Download above zip file,unzip and read movies.csv and ratings.csv files

#reading files
movies <- read.csv("movies.csv",stringsAsFactors=FALSE)

ratings <- read.csv("ratings.csv")

Printing first three rows of movies object

head(movies,3)
##   movieId                   title
## 1       1        Toy Story (1995)
## 2       2          Jumanji (1995)
## 3       3 Grumpier Old Men (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance

Printing first three rows ratings object

head(ratings,3)
##   userId movieId rating  timestamp
## 1      1      16    4.0 1217897793
## 2      1      24    1.5 1217895807
## 3      1      32    4.0 1217896246

Data Processing

Re-organize the information of movie categories so that it is easy to organize it.

Get list of categories

genres <- as.data.frame(movies$genres, stringsAsFactors=FALSE)
#strsplit and transpose the resulting list efficiently
categoryDF <- as.data.frame(tstrsplit(genres[,1], '[|]', type.convert=TRUE),stringsAsFactors=FALSE)

colnames(categoryDF) <- c(1:10)
#  19 different categories in total
category_list <- c("Action", "Adventure", "Animation", "Children","Comedy", "Crime","Documentary", "Drama", "Fantasy","Film-Noir", "Horror", "Musical", "Mystery","Romance","Sci-Fi", "Thriller", "War", "Western","IMAX") 

category_matrix <- matrix(0,10330,19) 
category_matrix[1,] <- category_list 
colnames(category_matrix) <- category_list 

#Loop through all elements
for (i in 1:nrow(categoryDF)) 
{
  for (c in 1:ncol(categoryDF)) 
  {
    which_col = which(category_matrix[1,] == categoryDF[i,c])
    category_matrix[i+1,which_col] <- 1
  }
}

#convert into dataframe

#remove category which is first row
category_matrix2 <- as.data.frame(category_matrix[-1,], stringsAsFactors=FALSE) 

for (c in 1:ncol(category_matrix2)) 
{
  category_matrix2[,c] <- as.integer(category_matrix2[,c])  #convert from characters to integers
} 

#head(category_matrix2)

search_matrix <- cbind(movies[,1:2], category_matrix2)
head(search_matrix,1)
##   movieId            title Action Adventure Animation Children Comedy
## 1       1 Toy Story (1995)      0         1         1        1      1
##   Crime Documentary Drama Fantasy Film-Noir Horror Musical Mystery Romance
## 1     0           0     0       1         0      0       0       0       0
##   Sci-Fi Thriller War Western IMAX
## 1      0        0   0       0    0

Coerce into a realRatingMAtrix

Using realRatingMatrix

#Create ratings matrix. Rows = userId, Columns = movieId
ratingmat <- dcast(ratings, userId~movieId, value.var = "rating", na.rm=FALSE)
ratingmat <- as.matrix(ratingmat[,-1]) 

## coerce into a realRatingMAtrix
ratingmat <- as(ratingmat, "realRatingMatrix")

Rating distribution

vector_ratings <- as.vector(ratingmat@data)
vector_ratings <- vector_ratings[vector_ratings != 0] # rating == 0 are NA values
vector_ratings <- factor(vector_ratings)

qplot(vector_ratings) + ggtitle("Distribution of the ratings")

Looks like most people are giving higher rating.

Trimming the data

In order to select the most relevant data, I define the minimum number of users per rated movie as 50 and the minimum views number per movie as 50:

trimmed_movies <- ratingmat[rowCounts(ratingmat) > 50,colCounts(ratingmat) > 50]
trimmed_movies
## 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.

Such a selection of the most relevant data contains 420 users and 447 movies, compared to previous 668 users and 10325 movies in the total dataset.

Using the same approach as previously, I visualize the top 2 percent of users and movies in the new matrix of the most relevant data:

IBCF

Item-based Collaborative Filtering Model

Item collaborative filtering is a form of collaborative filtering for recommender systems based on the similarity between items calculated using people’s ratings of those items.

Training and Test Data

The more you sweat in training, the less you bleed in combat

Building model with 80% of dataset as a training set, and rest as a test set.

# Random Samples and Permutations
sample <- sample(x = c(TRUE, FALSE), size = nrow(trimmed_movies),replace = TRUE, prob = c(0.8, 0.2))

head(sample)
## [1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE
train_data <- trimmed_movies[sample, ]
test_data <- trimmed_movies[!sample, ]

head(test_data)
## 1 x 447 rating matrix of class 'realRatingMatrix' with 50 ratings.

Building the recommendation model

#Recommender uses the registry mechanism from package registry to manage methods. This let's the user easily specify and add new methods. The registry is called recommenderRegistry.
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")

recommender_models$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
## 
## $method
## [1] "Cosine"
## 
## $normalize
## [1] "center"
## 
## $normalize_sim_matrix
## [1] FALSE
## 
## $alpha
## [1] 0.5
## 
## $na_as_zero
## [1] FALSE
#Create a Recommender Model
IBCF_recommender_model <- Recommender(data = train_data,method = "IBCF",parameter = list(k = 30))

Doing recommendation

Do the prediction

recc_predicted <- predict(object = IBCF_recommender_model,newdata = test_data, n = 10)
recc_predicted
## Recommendations as 'topNList' with n = 10 for 82 users.

Printing recommendation

first_user_recommendation <- recc_predicted@items[[1]] 
first_user_recommendation
##  [1] 126 136 140 165 183 187 188 190 202 212
movies_IDs <- recc_predicted@itemLabels[first_user_recommendation]
movies_IDs
##  [1] "750"  "832"  "904"  "1148" "1219" "1225" "1230" "1234" "1265" "1302"
movie_names <- movies_IDs

for (i in 1:10)
{
  movie_names[i] <- as.character(subset(movies,movies$movieId == movies_IDs[i])$title)
}
movie_names
##  [1] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)"
##  [2] "Ransom (1996)"                                                              
##  [3] "Rear Window (1954)"                                                         
##  [4] "Wallace & Gromit: The Wrong Trousers (1993)"                                
##  [5] "Psycho (1960)"                                                              
##  [6] "Amadeus (1984)"                                                             
##  [7] "Annie Hall (1977)"                                                          
##  [8] "Sting, The (1973)"                                                          
##  [9] "Groundhog Day (1993)"                                                       
## [10] "Field of Dreams (1989)"

Item based recomendation works on the basis of the similarity matrix. It is an eager-learning model, in which the system tries to construct a general, input-independent target function during training of the system.

Conslusions

Developed a collaborative filtering recommender (CFR) -Item Based- system for recommending movies.