Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users.
Collaborative filtering, source : wikipedia
MovieLens Latest Datasets is available at https://grouplens.org/datasets/movielens/latest/
File Name : ml-latest-small.zip File Direct download URL : http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Download above zip file,unzip and read movies.csv and ratings.csv files
#reading files
movies <- read.csv("movies.csv",stringsAsFactors=FALSE)
ratings <- read.csv("ratings.csv")
Printing first three rows of movies object
head(movies,3)
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
Printing first three rows ratings object
head(ratings,3)
## userId movieId rating timestamp
## 1 1 16 4.0 1217897793
## 2 1 24 1.5 1217895807
## 3 1 32 4.0 1217896246
Re-organize the information of movie categories so that it is easy to organize it.
genres <- as.data.frame(movies$genres, stringsAsFactors=FALSE)
#strsplit and transpose the resulting list efficiently
categoryDF <- as.data.frame(tstrsplit(genres[,1], '[|]', type.convert=TRUE),stringsAsFactors=FALSE)
colnames(categoryDF) <- c(1:10)
# 19 different categories in total
category_list <- c("Action", "Adventure", "Animation", "Children","Comedy", "Crime","Documentary", "Drama", "Fantasy","Film-Noir", "Horror", "Musical", "Mystery","Romance","Sci-Fi", "Thriller", "War", "Western","IMAX")
category_matrix <- matrix(0,10330,19)
category_matrix[1,] <- category_list
colnames(category_matrix) <- category_list
#Loop through all elements
for (i in 1:nrow(categoryDF))
{
for (c in 1:ncol(categoryDF))
{
which_col = which(category_matrix[1,] == categoryDF[i,c])
category_matrix[i+1,which_col] <- 1
}
}
#convert into dataframe
#remove category which is first row
category_matrix2 <- as.data.frame(category_matrix[-1,], stringsAsFactors=FALSE)
for (c in 1:ncol(category_matrix2))
{
category_matrix2[,c] <- as.integer(category_matrix2[,c]) #convert from characters to integers
}
#head(category_matrix2)
search_matrix <- cbind(movies[,1:2], category_matrix2)
head(search_matrix,1)
## movieId title Action Adventure Animation Children Comedy
## 1 1 Toy Story (1995) 0 1 1 1 1
## Crime Documentary Drama Fantasy Film-Noir Horror Musical Mystery Romance
## 1 0 0 0 1 0 0 0 0 0
## Sci-Fi Thriller War Western IMAX
## 1 0 0 0 0 0
Using realRatingMatrix
#Create ratings matrix. Rows = userId, Columns = movieId
ratingmat <- dcast(ratings, userId~movieId, value.var = "rating", na.rm=FALSE)
ratingmat <- as.matrix(ratingmat[,-1])
## coerce into a realRatingMAtrix
ratingmat <- as(ratingmat, "realRatingMatrix")
vector_ratings <- as.vector(ratingmat@data)
vector_ratings <- vector_ratings[vector_ratings != 0] # rating == 0 are NA values
vector_ratings <- factor(vector_ratings)
qplot(vector_ratings) + ggtitle("Distribution of the ratings")
Looks like most people are giving higher rating.
In order to select the most relevant data, I define the minimum number of users per rated movie as 50 and the minimum views number per movie as 50:
trimmed_movies <- ratingmat[rowCounts(ratingmat) > 50,colCounts(ratingmat) > 50]
trimmed_movies
## 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.
Such a selection of the most relevant data contains 420 users and 447 movies, compared to previous 668 users and 10325 movies in the total dataset.
Using the same approach as previously, I visualize the top 2 percent of users and movies in the new matrix of the most relevant data:
Item collaborative filtering is a form of collaborative filtering for recommender systems based on the similarity between items calculated using people’s ratings of those items.
The more you sweat in training, the less you bleed in combat
Building model with 80% of dataset as a training set, and rest as a test set.
# Random Samples and Permutations
sample <- sample(x = c(TRUE, FALSE), size = nrow(trimmed_movies),replace = TRUE, prob = c(0.8, 0.2))
head(sample)
## [1] TRUE FALSE TRUE TRUE TRUE FALSE
train_data <- trimmed_movies[sample, ]
test_data <- trimmed_movies[!sample, ]
head(test_data)
## 1 x 447 rating matrix of class 'realRatingMatrix' with 50 ratings.
#Recommender uses the registry mechanism from package registry to manage methods. This let's the user easily specify and add new methods. The registry is called recommenderRegistry.
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
##
## $method
## [1] "Cosine"
##
## $normalize
## [1] "center"
##
## $normalize_sim_matrix
## [1] FALSE
##
## $alpha
## [1] 0.5
##
## $na_as_zero
## [1] FALSE
#Create a Recommender Model
IBCF_recommender_model <- Recommender(data = train_data,method = "IBCF",parameter = list(k = 30))
recc_predicted <- predict(object = IBCF_recommender_model,newdata = test_data, n = 10)
recc_predicted
## Recommendations as 'topNList' with n = 10 for 82 users.
first_user_recommendation <- recc_predicted@items[[1]]
first_user_recommendation
## [1] 126 136 140 165 183 187 188 190 202 212
movies_IDs <- recc_predicted@itemLabels[first_user_recommendation]
movies_IDs
## [1] "750" "832" "904" "1148" "1219" "1225" "1230" "1234" "1265" "1302"
movie_names <- movies_IDs
for (i in 1:10)
{
movie_names[i] <- as.character(subset(movies,movies$movieId == movies_IDs[i])$title)
}
movie_names
## [1] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)"
## [2] "Ransom (1996)"
## [3] "Rear Window (1954)"
## [4] "Wallace & Gromit: The Wrong Trousers (1993)"
## [5] "Psycho (1960)"
## [6] "Amadeus (1984)"
## [7] "Annie Hall (1977)"
## [8] "Sting, The (1973)"
## [9] "Groundhog Day (1993)"
## [10] "Field of Dreams (1989)"
Item based recomendation works on the basis of the similarity matrix. It is an eager-learning model, in which the system tries to construct a general, input-independent target function during training of the system.
Developed a collaborative filtering recommender (CFR) -Item Based- system for recommending movies.