Motivation

The purpose here is to implement singular value decomposition to build a content-based recommender system.

Data Creation

We will build a toy dataset as follows.

moviesraw <- matrix(c(4, NA, 3, 5, 2, NA, 5, NA, 5, 4, NA, 4, 5, 3, 5, 4, 2, NA, 1, 1, 4, 2, 4, NA, 3, 5, 3, 5, 3, 4, 5, NA, NA, 5, NA, NA, 1, 3, 5, 5, NA, 4), nrow = 6, byrow = T)
colnames(moviesraw) <- c("Batman Begins", "Alice in Wonderland", "Dumb and Dumber", "Equilibrium", "Matilda", "Salt", "Marley and Me")
rownames(moviesraw) <- c("Adam", "Benjamin", "Charlie", "David", "Edward", "Matthew")
moviesraw
##          Batman Begins Alice in Wonderland Dumb and Dumber Equilibrium
## Adam                 4                  NA               3           5
## Benjamin            NA                   5               4          NA
## Charlie              5                   4               2          NA
## David                2                   4              NA           3
## Edward               3                   4               5          NA
## Matthew             NA                   1               3           5
##          Matilda Salt Marley and Me
## Adam           2   NA             5
## Benjamin       4    5             3
## Charlie        1    1             4
## David          5    3             5
## Edward        NA    5            NA
## Matthew        5   NA             4

Checking the Spread

First, we will obtain the similarity matrix and perform singular value decomposition to determine underlying relationships.

library(recommenderlab)
## Loading required package: Matrix
## Loading required package: arules
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
## Loading required package: proxy
## 
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
## 
##     as.matrix
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## The following object is masked from 'package:base':
## 
##     as.matrix
## Loading required package: registry
m <- as(moviesraw, "realRatingMatrix")
simm <- as.matrix(similarity(m, method = "cosine", which = "items"))
simm
##                     Batman Begins Alice in Wonderland Dumb and Dumber
## Batman Begins           0.0000000           0.6327720       0.6343582
## Alice in Wonderland     0.6327720           0.0000000       0.7469371
## Dumb and Dumber         0.6343582           0.7469371       0.0000000
## Equilibrium             0.4606281           0.2572806       0.4920678
## Matilda                 0.3714512           0.6760068       0.5831297
## Salt                    0.4567734           0.9154578       0.7644551
## Marley and Me           0.7132675           0.6702335       0.6207359
##                     Equilibrium   Matilda      Salt Marley and Me
## Batman Begins         0.4606281 0.3714512 0.4567734     0.7132675
## Alice in Wonderland   0.2572806 0.6760068 0.9154578     0.6702335
## Dumb and Dumber       0.4920678 0.5831297 0.7644551     0.6207359
## Equilibrium           0.0000000 0.7725291 0.1512658     0.8188504
## Matilda               0.7725291 0.0000000 0.5515663     0.8833005
## Salt                  0.1512658 0.5515663 0.0000000     0.4601322
## Marley and Me         0.8188504 0.8833005 0.4601322     0.0000000
s <- svd(simm)
s$u #left singular matrix
##            [,1]        [,2]        [,3]       [,4]        [,5]        [,6]
## [1,] -0.3461233  0.37152613 -0.13957343  0.1653793 -0.20420339 -0.80788666
## [2,] -0.4032681 -0.49208028 -0.52486033 -0.1869991 -0.35601296  0.10355086
## [3,] -0.3951422  0.04425780  0.03511763 -0.4512097  0.76551701 -0.09442493
## [4,] -0.3187832 -0.33798264 -0.27010628  0.4852534  0.31997306  0.02289997
## [5,] -0.3963520  0.64347471 -0.20573845 -0.1544017 -0.19052496  0.50780337
## [6,] -0.3538040  0.07631013  0.36733016  0.6281951  0.05870674  0.25685652
## [7,] -0.4213564 -0.28939830  0.67349126 -0.2831096 -0.32157495 -0.05758774
##             [,7]
## [1,] -0.03092018
## [2,] -0.38381877
## [3,] -0.20529994
## [4,]  0.61058135
## [5,]  0.26175021
## [6,] -0.51963178
## [7,]  0.31343115
V <- s$v
t(V) #right singular matrix
##             [,1]       [,2]        [,3]        [,4]       [,5]        [,6]
## [1,] -0.34612328 -0.4032681 -0.39514224 -0.31878321 -0.3963520 -0.35380401
## [2,] -0.37152613  0.4920803 -0.04425780  0.33798264 -0.6434747 -0.07631013
## [3,]  0.13957343  0.5248603 -0.03511763  0.27010628  0.2057385 -0.36733016
## [4,] -0.16537933  0.1869991  0.45120968 -0.48525338  0.1544017 -0.62819511
## [5,]  0.20420339  0.3560130 -0.76551701 -0.31997306  0.1905250 -0.05870674
## [6,]  0.80788666 -0.1035509  0.09442493 -0.02289997 -0.5078034 -0.25685652
## [7,] -0.03092018 -0.3838188 -0.20529994  0.61058135  0.2617502 -0.51963178
##             [,7]
## [1,] -0.42135643
## [2,]  0.28939830
## [3,] -0.67349126
## [4,]  0.28310955
## [5,]  0.32157495
## [6,]  0.05758774
## [7,]  0.31343115

Due to the great variation in the eigenvalues, it can be inferred that dimension reduction is possible.

s$d #eigenvalues
## [1] 3.6462514 1.0000000 0.9850439 0.9274296 0.6581750 0.3478795 0.2722766

Creating a Content-Based Recommender

First, we will build a toy vector that contains text information about these movies hardcoded from imdb.

reviews <- c("After training with his mentor, Batman begins his fight to free crime-ridden Gotham City from the corruption that Scarecrow and the League of Shadows have cast upon it.", "Nineteen-year-old Alice returns to the magical world from her childhood adventure, where she reunites with her old friends and learns of her true destiny: to end the Red Queen's reign of terror.", "Twenty years since their first adventure, Lloyd and Harry go on a road trip to find Harry's newly discovered daughter, who was given up for adoption.", "In a fascist future where all forms of feeling are illegal, a man in charge of enforcing the law rises to overthrow the system.", "Story of a wonderful little girl, who happens to be a genius, and her wonderful teacher vs. the worst parents ever and the worst school principal imaginable.", "A CIA agent goes on the run after a defector accuses her of being a Russian spy.", "A family learns important life lessons from their adorable, but naughty and neurotic dog.")
reviews
## [1] "After training with his mentor, Batman begins his fight to free crime-ridden Gotham City from the corruption that Scarecrow and the League of Shadows have cast upon it."                          
## [2] "Nineteen-year-old Alice returns to the magical world from her childhood adventure, where she reunites with her old friends and learns of her true destiny: to end the Red Queen's reign of terror."
## [3] "Twenty years since their first adventure, Lloyd and Harry go on a road trip to find Harry's newly discovered daughter, who was given up for adoption."                                             
## [4] "In a fascist future where all forms of feeling are illegal, a man in charge of enforcing the law rises to overthrow the system."                                                                   
## [5] "Story of a wonderful little girl, who happens to be a genius, and her wonderful teacher vs. the worst parents ever and the worst school principal imaginable."                                     
## [6] "A CIA agent goes on the run after a defector accuses her of being a Russian spy."                                                                                                                  
## [7] "A family learns important life lessons from their adorable, but naughty and neurotic dog."

Next, the tm package is used for text-mining where a corpus for each movie is created.

library(tm)
## Loading required package: NLP
## 
## Attaching package: 'tm'
## The following object is masked from 'package:arules':
## 
##     inspect
movie_corpus <- Corpus(VectorSource(reviews))

Now a document-term matrix is created to determine the importance of each movie. Then, TF-IDF weighting is used adjusting for the fact that certain words appear more often.

dtm <- DocumentTermMatrix(movie_corpus, control = list(removePunctuation = TRUE, removeNumbers = TRUE, stopwords = TRUE, tolower = TRUE, weighting = weightTfIdf))
dtm_matrix <- as.matrix(dtm)
dimnames(dtm_matrix)$Docs <- colnames(moviesraw)

Performing Singular Value Decomposition on Content

Singular value decomposition is performed as follows.

movie_svd <- svd(dtm_matrix)
Sigmak <- movie_svd$d; Uk <- movie_svd$u; Vk <- t(as.matrix(movie_svd$v))

Creating the System

The reduced transposed matrix below is the basis for this recommender system with the columns representing the 5 movies and the rows representing the 5 singular values.

movie_cat <- as.matrix(t(Uk))
dimnames(movie_cat) <- list(SVs = paste0("sv", 1:7), Movies = colnames(moviesraw))

A similar matrix of users and categories can be created now after normalizing the toy dataset and then creating a mean-centered matrix.

library(recommenderlab)
movies <- as(moviesraw, "realRatingMatrix")
movies_norm <- normalize(movies)
norm_cat <- as.vector(movies_norm@data)
norm_cat <- ifelse(norm_cat == 0, NA, norm_cat)
norm_cat <- matrix(norm_cat, nrow = nrow(movies), ncol = ncol(movies))

user_cat <- matrix(nrow = nrow(moviesraw), ncol = ncol(moviesraw))
for (i in 1:nrow(user_cat)){
  for (j in 1:ncol(user_cat)){
    user_cat[i,j] <- sum(norm_cat[i,]*movie_cat[j,], na.rm = T)/sum(movie_cat[j,] != 0) 
  }
}

Now that both matrices have been created, Pearson’s correlation can be used to estimate the ratings for unrated movies.

estimated <- matrix(nrow = nrow(moviesraw), ncol = ncol(moviesraw))
colnames(estimated) = colnames(moviesraw)
rownames(estimated) = rownames(moviesraw)
for (i in 1:nrow(estimated)){
  for(j in 1:ncol(estimated)){
    if(is.na(norm_cat[i,j])){
      for(k in 1:nrow(movie_cat)){
        num <- sum(user_cat[i,k]*movie_cat[k,j])
        den <- sqrt(sum(user_cat[i,]^2, na.rm = T))*sqrt(sum(movie_cat[,j]^2, na.rm = T))
      }
      estimated[i,j] <- num/den
    }else{estimated[i,j] <- NA}
  }
}
estimated
##          Batman Begins Alice in Wonderland Dumb and Dumber Equilibrium
## Adam                NA           0.0108498              NA          NA
## Benjamin             0                  NA              NA           0
## Charlie             NA                  NA              NA           0
## David               NA                  NA     -0.00607948          NA
## Edward              NA                  NA              NA           0
## Matthew              0                  NA              NA          NA
##          Matilda Salt Marley and Me
## Adam          NA    0            NA
## Benjamin      NA   NA            NA
## Charlie       NA   NA            NA
## David         NA   NA            NA
## Edward         0   NA   0.004006255
## Matthew       NA    0            NA

SVD is not a suitable method to implement when building a content-based recommender system. The reason for this is because the ratings for this matrix are supposed to be between 1 and 5. There are ratings that are 0 as well as ratings that are close to 0. It seems that content-based recommendations would not suffice.