In order to build our recommendation system, we have used the MovieLens Dataset. It consists of two files movies.csv and ratings.csv. This data consists of 105339 ratings applied over 10329 movies.
Loading required package: Matrix
Loading required package: arules
Warning:
Attaching package: ‘arules’
The following objects are masked from ‘package:base’:
abbreviate, write
Loading required package: proxy
Warning: package ‘proxy’ was built under R version 4.1.3
Attaching package: ‘proxy’
The following object is masked from ‘package:Matrix’:
as.matrix
The following objects are masked from ‘package:stats’:
as.dist, dist
The following object is masked from ‘package:base’:
as.matrix
Loading required package: registry
Registered S3 methods overwritten by 'registry':
method from
print.registry_field proxy
print.registry_entry proxy
library(ggplot2)
Warning:
library(data.table)
Warning: package ‘data.table’ was built under R version 4.1.3
Registered S3 method overwritten by 'data.table':
method from
print.data.table
data.table 1.14.2 using 4 threads (see ?getDTthreads). Latest news: r-datatable.com
library(reshape2)
Warning: package ‘reshape2’ was built under R version 4.1.3
Attaching package: ‘reshape2’
The following objects are masked from ‘package:data.table’:
dcast, melt
We will now retrieve our data from movies.csv into movie_data dataframe and ratings.csv into rating_data. We will use the str() function to display information about the movie_data dataframe.
movie_data <- read.csv("movies.csv",stringsAsFactors=FALSE)
rating_data <- read.csv("ratings.csv")
str(movie_data)
'data.frame': 10329 obs. of 3 variables:
$ movieId: int 1 2 3 4 5 6 7 8 9 10 ...
$ : chr "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1995)" "Waiting to Exhale (1995)" ...
$ genres : chr "Adventure|Animation|Children|Comedy|Fantasy" "Adventure|Children|Fantasy" "Comedy|Romance" "Comedy|Drama|Romance"
summary(movie_data)
title
Min. : 1 Length:10329
Class :character
Median : 7088
Mean : 31924
3rd Qu.: 59900
Max. :149532 genres
Length:10329
Class :character
head(movie_data)
summary(rating_data)
userId movieId
Min. : 1.0 Min. : 1
1st Qu.:192.0 1st Qu.: 1073
Median :383.0
Mean :364.9 Mean : 13381
3rd Qu.:557.0 3rd Qu.: 5991
Max. :668.0 Max. :149532
rating timestamp Min. :0.500 Min. :8.286e+08
1st Qu.:3.000 1st Qu.:9.711e+08
Median :3.500
Mean :3.517
3rd Qu.:4.000 3rd Qu.:1.275e+09
Max. :5.000 Max. :1.452e+09
head(rating_data)
From the above table, we observe that the userId column, as well as the movieId column, consist of integers. Furthermore, we need to convert the genres present in the movie_data dataframe into a more usable format by the users. In order to do so, we will first create a one-hot encoding to create a matrix that comprises of corresponding genres for each of the films.
movie_genre <- as.data.frame(movie_data$genres, stringsAsFactors = FALSE)
library(data.table)
movie_genre2 <- as.data.frame(tstrsplit(movie_genre[,1], '[|]', type.convert = TRUE), stringsAsFactors = FALSE)
colnames(movie_genre2) <- c(1:10)
list_genre <- c("Action", "Adventure", "Animation", "Children", "Comedy", "Crime","Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery","Romance", "Sci-Fi", "Thriller", "War", "Western")
genre_mat1[1,] <- list_genre
colnames(genre_mat1) <- list_genre
for (col in 1:ncol(movie_genre2)) {
gen_col = which(genre_mat1[1,] == movie_genre2[index, col])
genre_mat1[index+1, gen_col] <- 1
}
}
genre_mat2[,col] <- as.integer(genre_mat2[,col])
}
str(genre_mat2)
'data.frame': 10329 obs. of 18 variables:
: int 0 0 0 0 0 1 0 0 1 1 ...
$ : int 1 1 0 0 0 0 0 1 0 1 ...
Animation : int 1 0 0 0 0 0 0 0 0 0 ...
$ Children : int 1 1 0 0 0 0 0 1 0 0 ...
$ Comedy int 1 0 1 1 1 0 1 0 0 0 ...
$ Crime int 0 0 0 0 0 1 0 0 0 0 ...
Documentary: int 0 0 0 0 0 0 0 0 0 0 ...
Drama : int 0 0 0 1 0 0 0 0 0 0 ...
$ Fantasy : int 1 1 0 0 0 0 0 0 0 0 ...
$ Film-Noir : int 0 0 0 0 0 0 0 0 0 0 ...
$ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
$ Musical : int ... $ Mystery : int ... $ Romance int 0 0 1 1 0 0 1 0 0 0 ...
$ Sci-Fi : int 0 0 0 0 0 0 0 0 0 0 ...
$ Thriller : int 0 0 0 0 0 1 0 0 0 1 ...
$ War : int 0 0 0 0 0 0 0 0 0 0 ...
$ Western : int 0 0 0 0 0 0 0 0 0 0
SearchMatrix <- cbind(movie_data[,1:2], genre_mat2[])
head(SearchMatrix)
ratingMatrix <- dcast(rating_data, userId~movieId, value.var = "rating", na.rm=FALSE)
ratingMatrix <- as.matrix(ratingMatrix[,-1])
ratingMatrix <- as(ratingMatrix, "realRatingMatrix")
ratingMatrix
668 10325 ‘realRatingMatrix’ 105339 ratings.
recommendation_model <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
names(recommendation_model)
[1]
[2] "ALS_realRatingMatrix"
"ALS_implicit_realRatingMatrix"
[4] "IBCF_realRatingMatrix"
[5] "LIBMF_realRatingMatrix"
[6] "POPULAR_realRatingMatrix"
[7] "RANDOM_realRatingMatrix"
[8] "RERECOMMEND_realRatingMatrix" [9] "SVD_realRatingMatrix"
[10] "SVDF_realRatingMatrix" [11] "UBCF_realRatingMatrix"
lapply(recommendation_model, "[[", "description")
$HYBRID_realRatingMatrix
[1] "Hybrid recommender that aggegates several recommendation strategies using weighted averages."
$ALS_realRatingMatrix
[1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
$ALS_implicit_realRatingMatrix
[1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
$IBCF_realRatingMatrix
[1] "Recommender based on item-based collaborative filtering."
$LIBMF_realRatingMatrix
[1] "Matrix factorization with LIBMF via package recosystem (https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html)."
$POPULAR_realRatingMatrix
[1]
$RANDOM_realRatingMatrix
[1]
$RERECOMMEND_realRatingMatrix
[1] "Re-recommends highly rated items (real ratings)."
$SVD_realRatingMatrix
"Recommender based on SVD approximation with column-mean imputation."
$SVDF_realRatingMatrix
[1] "Recommender based on Funk SVD with gradient descend (https://sifter.org/~simon/journal/20061211.html)."
$UBCF_realRatingMatrix
[1] "Recommender based on user-based collaborative filtering."
recommendation_model$IBCF_realRatingMatrix$parameters
$k
30
$method
[1] "Cosine"
$normalize
"center"
$normalize_sim_matrix
[1] FALSE
$alpha
[1] 0.5
$na_as_zero
[1] FALSE
Collaborative Filtering involves suggesting movies to the users that are based on collecting preferences from many other users. For example, if a user A likes to watch action films and so does user B, then the movies that the user B will watch in the future will be recommended to A and vice-versa. Therefore, recommending movies is dependent on creating a relationship of similarity between the two users. With the help of recommenderlab, we can compute similarities using various operators like cosine, pearson as well as jaccard.
similarity_mat <- similarity(ratingMatrix[1:4, ], method = "cosine", which = "users")
as.matrix(similarity_mat)
1 2 4
1 0.0000000 0.9760860 0.9914398
2 0.9760860 0.0000000 0.9925732 0.9374253 0.9641723 0.9925732 0.0000000 0.9888968
4 0.9914398 0.9374253 0.9888968
image(as.matrix(similarity_mat), main = "User's Similarities")
In the above matrix, each row and column represents a user. We have taken four users and each cell in this matrix represents the similarity that is shared between the two users.
Now, we delineate the similarity that is shared between the films:
movie_similarity <- similarity(ratingMatrix[, 1:4], method = "cosine", which = "items")
as.matrix(movie_similarity)
2 4
1 0.9669732 0.9559341
2 0.9669732 0.0000000 0.9658757 0.9412416
3 0.9559341 0.9658757 0.0000000 0.9864877
4 0.9101276 0.9412416 0.9864877 0.0000000
image(as.matrix(movie_similarity), main = "Movies similarity")
rating_values <- as.vector(ratingMatrix@data)
unique(rating_values)
[1] 0.0 5.0 3.0 4.5 2.0 3.5 2.5 0.5
Table_of_Ratings <- table(rating_values)
Table_of_Ratings
rating_values
0 0.5 1 1.5 2 2.5
6791761 1198 1567 7943
3 3.5 4 4.5 5
21729 12237 28880 8187 14856
library(ggplot2)
movie_views <- colCounts(ratingMatrix)
table_views <- data.frame(movie = names(movie_views), views = movie_views)
table_views$title <- NA
table_views[index, 3] <- as.character(subset(movie_data, movie_data$movieId == table_views[index, 1])$title)
}
table_views[1:6,]
ggplot(table_views[1:6, ], aes(x = title, y = views)) +
geom_bar(stat="identity", fill = 'steelblue') +
geom_text(aes(label=views), vjust=-0.3, size=3.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Total Views of the Top Films")
image(ratingMatrix[1:20, 1:25], axes = FALSE, main = "Heatmap of the first 25 rows and 25 columns")
For finding useful data in our dataset, we have set the threshold for the minimum number of users who have rated a film as 50. This is also same for minimum number of views that are per film. This way, we have filtered a list of watched films from least-watched ones.
movie_ratings <- ratingMatrix[rowCounts(ratingMatrix) > 50, colCounts(ratingMatrix) > 50]
movie_ratings
420 x 447 rating matrix of class ‘realRatingMatrix’ 38341 ratings.
From the above output of ‘movie_ratings’, we observe that there are 420 users and 447 films as opposed to the previous 668 users and 10325 films. We can now delineate our matrix of relevant users as follows:
minimum_movies<- quantile(rowCounts(movie_ratings), 0.98)
minimum_users <- quantile(colCounts(movie_ratings), 0.98)
image(movie_ratings[rowCounts(movie_ratings) > minimum_movies,
colCounts(movie_ratings) > minimum_users],
main = "Heatmap of the top users and movies")
Now, we will visualize the distribution of the average ratings per user.
average_ratings <- rowMeans(movie_ratings)
qplot(average_ratings, fill=I("steelblue"), col=I("red")) +
ggtitle("Distribution of the average rating per user")
`stat_bin()` using `bins = 30`. Pick better
value with `binwidth`.
Normalization is a data preparation procedure to standardize the numerical values in a column to a common scale value. This is done in such a way that there is no distortion in the range of values. Normalization transforms the average value of our ratings column to 0. We then plot a heatmap that delineates our normalized ratings.
normalized_ratings <- normalize(movie_ratings)
sum(rowMeans(normalized_ratings) > 0.0001)
[1]
image(normalized_ratings[rowCounts(normalized_ratings) > minimum_movies, colCounts(normalized_ratings) > minimum_users], main = "Normalized Ratings of the Top Users")
Binarizing the data means that we have two discrete values 1 and 0, which will allow our recommendation systems to work more efficiently. We will define a matrix that will consist of 1 if the rating is above 3 and otherwise it will be 0.
binary_minimum_movies <- quantile(rowCounts(movie_ratings), 0.95)
binary_minimum_users <- quantile(colCounts(movie_ratings), 0.95)
#movies_watched <- binarize(movie_ratings, minRating = 1)
good_rated_films <- binarize(movie_ratings, minRating = 3)
image(good_rated_films[rowCounts(movie_ratings) > binary_minimum_movies,
colCounts(movie_ratings) > binary_minimum_users],
main = "Heatmap of the top users and movies")
The collaborative filtering finds similarity in the items based on the people’s ratings of them. The algorithm first builds a similar-items table of the customers who have purchased them into a combination of similar items. This is then fed into the recommendation system.
sampled_data<- sample(x = c(TRUE, FALSE),
size = nrow(movie_ratings),
replace = TRUE,
prob = c(0.8, 0.2))
training_data <- movie_ratings[sampled_data, ]
testing_data <- movie_ratings[!sampled_data, ]
We will now explore the various parameters of our Item Based Collaborative Filter. These parameters are default in nature. In the first step, k denotes the number of items for computing their similarities. Here, k is equal to 30. Therefore, the algorithm will now identify the k most similar items and store their number. We use the cosine method which is the default one but you can also use pearson method.
recommendation_system <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommendation_system$IBCF_realRatingMatrix$parameters
$k
[1] 30
$method
[1] "Cosine"
$normalize
[1] "center"
$normalize_sim_matrix
[1] FALSE
$alpha
[1] 0.5
$na_as_zero
[1] FALSE
recommen_model <- Recommender(data = training_data,
method = "IBCF",
parameter = list(k = 30))
recommen_model
Recommender of type ‘IBCF’ for ‘realRatingMatrix’
learned using 355 users.
class(recommen_model)
[1]
attr(,"package")
[1] "recommenderlab"
model_info <- getModel(recommen_model)
class(model_info$sim)
[1] "dgCMatrix"
attr(,"package")[1] "Matrix"
dim(model_info$sim)
[1] 447 447
top_items <- 20
image(model_info$sim[1:top_items, 1:top_items],
sum_rows <- rowSums(model_info$sim > 0)
table(sum_rows)
sum_rows
30
sum_cols <- colSums(model_info$sim > 0)
qplot(sum_cols, fill=I("steelblue"), col=I("red"))+ ggtitle("Distribution of the column count")
`stat_bin()` using `bins = 30`. Pick better
value with `binwidth`.
We will create a top_recommendations variable which will be initialized to 10, specifying the number of films to each user. We will then use the predict() function that will identify similar items and will rank them appropriately. Here, each rating is used as a weight. Each weight is multiplied with related similarities. Finally, everything is added in the end.
top_recommendations <- 10 # the number of items to recommend to each user
predicted_recommendations <- predict(object = recommen_model,
newdata = testing_data,
n = top_recommendations)
predicted_recommendations
Recommendations as ‘topNList’ with n = 10 65 users.
user1 <- predicted_recommendations@items[[1]] # recommendation for the first user
movies_user1 <- predicted_recommendations@itemLabels[user1]
movies_user2 <- movies_user1
for (index in 1:10){
movies_user2[index] <- as.character(subset(movie_data,
movie_data$movieId == movies_user1[index])$title)
}
movies_user2
[1] "Sabrina (1995)"
[2] "Get Shorty (1995)" [3] "Clueless (1995)"
"Congo (1995)"
[5] "Net, The (1995)"
[6] "Little Women (1994)"
[7] "Quiz Show (1994)"
[8] "Santa Clause, The (1994)"
"Four Weddings and a Funeral (1994)"
[10]
recommendation_matrix <- sapply(predicted_recommendations@items,
function(x){ as.integer(colnames(movie_ratings)[x]) }) # matrix with the recommendations for each user
#dim(recc_matrix)
recommendation_matrix[,1:4]
0 1 2 3
[1,] 7 6 3 1704
[2,] 17 39 2355
[3,] 39 62 50 1674
[4,] 160 223 104 1343
[5,] 185 235 110 48516
[6,] 261 158
[7,] 300 527 165 1968
[8,] 317 541 293 110
[9,] 357 551 318 7147
[10,] 364 593 350 2011