The purpose here is to implement singular value decomposition to build a content-based recommender system.
We will build a toy dataset as follows.
moviesraw <- matrix(c(4, NA, 3, 5, 2, NA, 5, NA, 5, 4, NA, 4, 5, 3, 5, 4, 2, NA, 1, 1, 4, 2, 4, NA, 3, 5, 3, 5, 3, 4, 5, NA, NA, 5, NA, NA, 1, 3, 5, 5, NA, 4), nrow = 6, byrow = T)
colnames(moviesraw) <- c("Batman Begins", "Alice in Wonderland", "Dumb and Dumber", "Equilibrium", "Matilda", "Salt", "Marley and Me")
rownames(moviesraw) <- c("Adam", "Benjamin", "Charlie", "David", "Edward", "Matthew")
moviesraw
## Batman Begins Alice in Wonderland Dumb and Dumber Equilibrium
## Adam 4 NA 3 5
## Benjamin NA 5 4 NA
## Charlie 5 4 2 NA
## David 2 4 NA 3
## Edward 3 4 5 NA
## Matthew NA 1 3 5
## Matilda Salt Marley and Me
## Adam 2 NA 5
## Benjamin 4 5 3
## Charlie 1 1 4
## David 5 3 5
## Edward NA 5 NA
## Matthew 5 NA 4
First, we will obtain the similarity matrix and perform singular value decomposition to determine underlying relationships.
library(recommenderlab)
## Loading required package: Matrix
## Loading required package: arules
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
m <- as(moviesraw, "realRatingMatrix")
simm <- as.matrix(similarity(m, method = "cosine", which = "items"))
simm
## Batman Begins Alice in Wonderland Dumb and Dumber
## Batman Begins 0.0000000 0.6327720 0.6343582
## Alice in Wonderland 0.6327720 0.0000000 0.7469371
## Dumb and Dumber 0.6343582 0.7469371 0.0000000
## Equilibrium 0.4606281 0.2572806 0.4920678
## Matilda 0.3714512 0.6760068 0.5831297
## Salt 0.4567734 0.9154578 0.7644551
## Marley and Me 0.7132675 0.6702335 0.6207359
## Equilibrium Matilda Salt Marley and Me
## Batman Begins 0.4606281 0.3714512 0.4567734 0.7132675
## Alice in Wonderland 0.2572806 0.6760068 0.9154578 0.6702335
## Dumb and Dumber 0.4920678 0.5831297 0.7644551 0.6207359
## Equilibrium 0.0000000 0.7725291 0.1512658 0.8188504
## Matilda 0.7725291 0.0000000 0.5515663 0.8833005
## Salt 0.1512658 0.5515663 0.0000000 0.4601322
## Marley and Me 0.8188504 0.8833005 0.4601322 0.0000000
s <- svd(simm)
s$u #left singular matrix
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.3461233 0.37152613 -0.13957343 0.1653793 -0.20420339 -0.80788666
## [2,] -0.4032681 -0.49208028 -0.52486033 -0.1869991 -0.35601296 0.10355086
## [3,] -0.3951422 0.04425780 0.03511763 -0.4512097 0.76551701 -0.09442493
## [4,] -0.3187832 -0.33798264 -0.27010628 0.4852534 0.31997306 0.02289997
## [5,] -0.3963520 0.64347471 -0.20573845 -0.1544017 -0.19052496 0.50780337
## [6,] -0.3538040 0.07631013 0.36733016 0.6281951 0.05870674 0.25685652
## [7,] -0.4213564 -0.28939830 0.67349126 -0.2831096 -0.32157495 -0.05758774
## [,7]
## [1,] -0.03092018
## [2,] -0.38381877
## [3,] -0.20529994
## [4,] 0.61058135
## [5,] 0.26175021
## [6,] -0.51963178
## [7,] 0.31343115
V <- s$v
t(V) #right singular matrix
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.34612328 -0.4032681 -0.39514224 -0.31878321 -0.3963520 -0.35380401
## [2,] -0.37152613 0.4920803 -0.04425780 0.33798264 -0.6434747 -0.07631013
## [3,] 0.13957343 0.5248603 -0.03511763 0.27010628 0.2057385 -0.36733016
## [4,] -0.16537933 0.1869991 0.45120968 -0.48525338 0.1544017 -0.62819511
## [5,] 0.20420339 0.3560130 -0.76551701 -0.31997306 0.1905250 -0.05870674
## [6,] 0.80788666 -0.1035509 0.09442493 -0.02289997 -0.5078034 -0.25685652
## [7,] -0.03092018 -0.3838188 -0.20529994 0.61058135 0.2617502 -0.51963178
## [,7]
## [1,] -0.42135643
## [2,] 0.28939830
## [3,] -0.67349126
## [4,] 0.28310955
## [5,] 0.32157495
## [6,] 0.05758774
## [7,] 0.31343115
Due to the great variation in the eigenvalues, it can be inferred that dimension reduction is possible.
s$d #eigenvalues
## [1] 3.6462514 1.0000000 0.9850439 0.9274296 0.6581750 0.3478795 0.2722766
First, we will build a toy vector that contains text information about these movies hardcoded from imdb.
reviews <- c("After training with his mentor, Batman begins his fight to free crime-ridden Gotham City from the corruption that Scarecrow and the League of Shadows have cast upon it.", "Nineteen-year-old Alice returns to the magical world from her childhood adventure, where she reunites with her old friends and learns of her true destiny: to end the Red Queen's reign of terror.", "Twenty years since their first adventure, Lloyd and Harry go on a road trip to find Harry's newly discovered daughter, who was given up for adoption.", "In a fascist future where all forms of feeling are illegal, a man in charge of enforcing the law rises to overthrow the system.", "Story of a wonderful little girl, who happens to be a genius, and her wonderful teacher vs. the worst parents ever and the worst school principal imaginable.", "A CIA agent goes on the run after a defector accuses her of being a Russian spy.", "A family learns important life lessons from their adorable, but naughty and neurotic dog.")
reviews
## [1] "After training with his mentor, Batman begins his fight to free crime-ridden Gotham City from the corruption that Scarecrow and the League of Shadows have cast upon it."
## [2] "Nineteen-year-old Alice returns to the magical world from her childhood adventure, where she reunites with her old friends and learns of her true destiny: to end the Red Queen's reign of terror."
## [3] "Twenty years since their first adventure, Lloyd and Harry go on a road trip to find Harry's newly discovered daughter, who was given up for adoption."
## [4] "In a fascist future where all forms of feeling are illegal, a man in charge of enforcing the law rises to overthrow the system."
## [5] "Story of a wonderful little girl, who happens to be a genius, and her wonderful teacher vs. the worst parents ever and the worst school principal imaginable."
## [6] "A CIA agent goes on the run after a defector accuses her of being a Russian spy."
## [7] "A family learns important life lessons from their adorable, but naughty and neurotic dog."
Next, the tm package is used for text-mining where a corpus for each movie is created.
library(tm)
## Loading required package: NLP
##
## Attaching package: 'tm'
## The following object is masked from 'package:arules':
##
## inspect
movie_corpus <- Corpus(VectorSource(reviews))
Now a document-term matrix is created to determine the importance of each movie. Then, TF-IDF weighting is used adjusting for the fact that certain words appear more often.
dtm <- DocumentTermMatrix(movie_corpus, control = list(removePunctuation = TRUE, removeNumbers = TRUE, stopwords = TRUE, tolower = TRUE, weighting = weightTfIdf))
dtm_matrix <- as.matrix(dtm)
dimnames(dtm_matrix)$Docs <- colnames(moviesraw)
Singular value decomposition is performed as follows.
movie_svd <- svd(dtm_matrix)
Sigmak <- movie_svd$d; Uk <- movie_svd$u; Vk <- t(as.matrix(movie_svd$v))
The reduced transposed matrix below is the basis for this recommender system with the columns representing the 5 movies and the rows representing the 5 singular values.
movie_cat <- as.matrix(t(Uk))
dimnames(movie_cat) <- list(SVs = paste0("sv", 1:7), Movies = colnames(moviesraw))
A similar matrix of users and categories can be created now after normalizing the toy dataset and then creating a mean-centered matrix.
library(recommenderlab)
movies <- as(moviesraw, "realRatingMatrix")
movies_norm <- normalize(movies)
norm_cat <- as.vector(movies_norm@data)
norm_cat <- ifelse(norm_cat == 0, NA, norm_cat)
norm_cat <- matrix(norm_cat, nrow = nrow(movies), ncol = ncol(movies))
user_cat <- matrix(nrow = nrow(moviesraw), ncol = ncol(moviesraw))
for (i in 1:nrow(user_cat)){
for (j in 1:ncol(user_cat)){
user_cat[i,j] <- sum(norm_cat[i,]*movie_cat[j,], na.rm = T)/sum(movie_cat[j,] != 0)
}
}
Now that both matrices have been created, Pearson’s correlation can be used to estimate the ratings for unrated movies.
estimated <- matrix(nrow = nrow(moviesraw), ncol = ncol(moviesraw))
colnames(estimated) = colnames(moviesraw)
rownames(estimated) = rownames(moviesraw)
for (i in 1:nrow(estimated)){
for(j in 1:ncol(estimated)){
if(is.na(norm_cat[i,j])){
for(k in 1:nrow(movie_cat)){
num <- sum(user_cat[i,k]*movie_cat[k,j])
den <- sqrt(sum(user_cat[i,]^2, na.rm = T))*sqrt(sum(movie_cat[,j]^2, na.rm = T))
}
estimated[i,j] <- num/den
}else{estimated[i,j] <- NA}
}
}
estimated
## Batman Begins Alice in Wonderland Dumb and Dumber Equilibrium
## Adam NA 0.0108498 NA NA
## Benjamin 0 NA NA 0
## Charlie NA NA NA 0
## David NA NA -0.00607948 NA
## Edward NA NA NA 0
## Matthew 0 NA NA NA
## Matilda Salt Marley and Me
## Adam NA 0 NA
## Benjamin NA NA NA
## Charlie NA NA NA
## David NA NA NA
## Edward 0 NA 0.004006255
## Matthew NA 0 NA
SVD is not a suitable method to implement when building a content-based recommender system. The reason for this is because the ratings for this matrix are supposed to be between 1 and 5. There are ratings that are 0 as well as ratings that are close to 0. It seems that content-based recommendations would not suffice.