Introduction

Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system.You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found. SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).

Notes/Limitations: -nSVD builds features that may or may not map neatly to items (such as movie genres or news topics). As in many areas of machine learning, the lack of explainability can be an issue).

  • SVD requires that there are no missing values. There are various ways to handle this, including (1) imputation of missing values, (2) mean-centering values around 0, or (3) using a more advance technique, such as stochastic gradient descent to simulate SVD in populating the factored matrices.

  • Calculating the SVD matrices can be computationally expensive, although calculating ratings once the factorization is completed is very fast. You may need to create a subset of your data for SVD calculations to be successfully performed, especially on a machine with a small RAM footprint.

Get Data

The dataset was retrieved from dataworld website. The data includes user ratings for different Amazon electronic devices from 1-5 with 5 being the highest rating. We are only interested in the name of the device , username and rating, so we extract only those columns as a start.

Amazon Rroduct Review Dataset

https://data.world/datafiniti/consumer-reviews-of-amazon-products

amz<- read.csv("https://raw.githubusercontent.com/apag101/Data612/master/Projects/Project3/amazon.csv", header = TRUE)
amz<- subset(amz, select=c('name','reviews.rating', 'reviews.username'))

Review and Transform

A review of the data show that there are some NA value. We use complete.cases to remove any rows with NA values.

## Observations: 1,597
## Variables: 3
## $ name             <fct> Kindle Paperwhite, Kindle Paperwhite, Kindle Paper...
## $ reviews.rating   <int> 5, 5, 4, 5, 5, NA, NA, NA, NA, NA, NA, NA, NA, 4, ...
## $ reviews.username <fct> Cristina M, Ricky, Tedd Gardiner, Dougal, Miljan D...
## Observations: 1,177
## Variables: 3
## $ name             <fct> Kindle Paperwhite, Kindle Paperwhite, Kindle Paper...
## $ reviews.rating   <int> 5, 5, 4, 5, 5, 4, 5, 5, 4, 5, 4, 4, 5, 5, 5, 4, 5,...
## $ reviews.username <fct> Cristina M, Ricky, Tedd Gardiner, Dougal, Miljan D...
## 
##   1   2   3   4   5 
##  42  34 124 236 741

With the removal of NA values rows go from 1597 to 1177

In the next section, the data is transformed into a matrix with username as the rows, device name as the column and ratings as the matrix data.

amz.dat <- matrix(data=amz$reviews.rating,nrow=length(unique(amz$reviews.username)),ncol=length(unique(amz$name)))
rownames(amz.dat)<-c(paste(unique(amz$reviews.username)))
colnames(amz.dat)<-c(paste(unique(amz$name)))
glimpse(amz.dat)
##  int [1:836, 1:62] 5 5 4 5 5 4 5 5 4 5 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:836] "Cristina M" "Ricky" "Tedd Gardiner" "Dougal" ...
##   ..$ : chr [1:62] "Kindle Paperwhite" "Kindle Keyboard" "Certified Refurbished Amazon Fire TV (Previous Generation - 1st)" "Amazon Echo Dot Case (fits Echo Dot 2nd Generation only) - Indigo Fabric" ...

The resulting matrix has 836 rows and 62 columns.

SVD Function

Singular Value Decomposition(SVD) is a method of dimensionality reduction. The functions begins with an input matrix of mxn (eg. M rows r columns, M documents, n terms) that is a product of 3 matrices U, D, V.

  • U mxr (m documents, r concepts) is the left singular matrix.
  • D rxr (strength of each concept, r:rank of the matrix a) matrix of singular values rxr. This matrix has zeros or singular values everywhere except in the diagonals.
  • V nxr (n terms, r concepts) or right singular matrix.

The final formula equates to: A = \(U \sum_{} V^T\)

In this assignment we are using the irlba library which enables us to state the number of singular values to estimate. In this example we set nv to 30. The list belows shows the top 10 D, U, V values as well as the number of iterations and total number of matrix vector products carried out.

amz.svd<-irlba(t(amz.dat), nv=30, maxit=200)

D values:

##  [1] 993.00566  47.58560  39.83519  37.36527  37.03986  36.77521  36.60924
##  [8]  36.20103  35.50962  35.16535

U values:

##  [1]  0.12990172  0.12490559  0.12613131  0.13076546  0.12528140  0.12350393
##  [7] -0.08328865 -0.08634993  0.16311025 -0.02753623

V , iterations and matrix products:

##  [1] 0.035011953 0.033605630 0.034884376 0.033745851 0.033232679 0.035488864
##  [7] 0.005053155 0.023277869 0.012220521 0.040270522
## [1] 12
## [1] 148

This final code shows the SVD matrix calculation carried out. The comparison of the first 10 original values and new values are similar.

amz.svdi<-amz.svd$u %*% amz.svd$d %*% t(amz.svd$v[1,])
amz.svdi[1:10]
##  [1] 4.020200 3.946102 3.422356 5.159498 3.165943 4.292214 4.882875 2.983146
##  [9] 3.701984 4.304987
amz.dat[1:10]
##  [1] 5 5 4 5 5 4 5 5 4 5

Manual Calculation

This next group of code does the SVD caculations manually. We are doing the dot product of the original matrix and the transpose of the original matrix then calculation the eigen value to derive the V values. A comparison of the original V values match.

atransv<- amz.dat %*% t(amz.dat)
atransv.e <- eigen(atransv)
head(atransv.e$vectors[1:6])
## [1] -0.03501195 -0.03360563 -0.03488438 -0.03374585 -0.03323268 -0.03548886
head(amz.svd$v[1:6])
## [1] 0.03501195 0.03360563 0.03488438 0.03374585 0.03323268 0.03548886

Here we do the reverse, the dot product of the transpose of the original matrix and the original matrix and calculate the eigen values to derive the U values. A comparison of the original U values match.

atransu<- t(amz.dat) %*% amz.dat
atransu.e <- eigen(atransu)
head(atransu.e$vectors[1:6])
## [1] -0.1299017 -0.1249056 -0.1261313 -0.1307655 -0.1252814 -0.1235039
head(amz.svd$u[1:6])
## [1] 0.1299017 0.1249056 0.1261313 0.1307655 0.1252814 0.1235039

For the diagonal, we calculate the square root of the V values and multiple by the V diagonals. This sets all values to 0 except the 3 diagonals. The comparison of the diagonals are very close, but not a precise match.

r <- sqrt(atransv.e$values)
r <- r * diag(length(r))[,1:3]
r[1:3,]
##          [,1]    [,2]     [,3]
## [1,] 993.0057  0.0000  0.00000
## [2,]   0.0000 47.5856  0.00000
## [3,]   0.0000  0.0000 39.83519
amz.svd$d[1:3]
## [1] 993.00566  47.58560  39.83519

Here we take the 3 matrices and calculate the SVD to get final values. The final values are very close to originals and values calculated by the irlba SVD function.

atransm<-atransu.e$vectors[1:3]*-1 %*% r[1:3] %*% t(atransv.e$vectors)[1:3]*-1
atransm[1:3]
## [1] 4.516302 4.342601 4.385216
amz.dat[1:3]
## [1] 5 5 4
amz.svdi[1:3]
## [1] 4.020200 3.946102 3.422356

Conclusion

SVD allows us to find similarity of user and concepts by reducing dimensions. With the new matrices you can now select a user space and find if how similar a user is to others based on perference by using the cross product of user rating and ratings for similar devices.

uq<- matrix(c(0),nrow=nrow(amz.dat))
uq[1,1]<-5
sqrt(uq[1:3]%*%atransm)
##          [,1]
## [1,] 4.752001
atransm[1]
## [1] 4.516302

APPENDIX

Code used in analysis

knitr::opts_chunk$set(
    echo = FALSE,
    message = FALSE,
    warning = FALSE
)
#knitr::opts_chunk$set(echo = TRUE)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(mlbench)
library(e1071)
library(fpp2)
library(mlr)
library(recommenderlab)
library(irlba)
amz<- read.csv("https://raw.githubusercontent.com/apag101/Data612/master/Projects/Project3/amazon.csv", header = TRUE)
amz<- subset(amz, select=c('name','reviews.rating', 'reviews.username'))
namz<-nrow(amz)
glimpse(amz)
amz<-subset(amz, complete.cases(amz))
namz2<-nrow(amz)
glimpse(amz)
table(amz$reviews.rating)
amz.dat <- matrix(data=amz$reviews.rating,nrow=length(unique(amz$reviews.username)),ncol=length(unique(amz$name)))
rownames(amz.dat)<-c(paste(unique(amz$reviews.username)))
colnames(amz.dat)<-c(paste(unique(amz$name)))
glimpse(amz.dat)
amz.svd<-irlba(t(amz.dat), nv=30, maxit=200)
amz.svd$d[1:10] 
head(amz.svd$u)[1:10]
head(amz.svd$v)[1:10]
amz.svd$iter
amz.svd$mprod
amz.svdi<-amz.svd$u %*% amz.svd$d %*% t(amz.svd$v[1,])
amz.svdi[1:10]
amz.dat[1:10]
atransv<- amz.dat %*% t(amz.dat)
atransv.e <- eigen(atransv)
head(atransv.e$vectors[1:6])
head(amz.svd$v[1:6])
atransu<- t(amz.dat) %*% amz.dat
atransu.e <- eigen(atransu)
head(atransu.e$vectors[1:6])
head(amz.svd$u[1:6])
r <- sqrt(atransv.e$values)
r <- r * diag(length(r))[,1:3]
r[1:3,]
amz.svd$d[1:3]
atransm<-atransu.e$vectors[1:3]*-1 %*% r[1:3] %*% t(atransv.e$vectors)[1:3]*-1
atransm[1:3]
amz.dat[1:3]
amz.svdi[1:3]
uq<- matrix(c(0),nrow=nrow(amz.dat))
uq[1,1]<-5
sqrt(uq[1:3]%*%atransm)
atransm[1]