Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system.You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found. SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).
Notes/Limitations: -nSVD builds features that may or may not map neatly to items (such as movie genres or news topics). As in many areas of machine learning, the lack of explainability can be an issue).
SVD requires that there are no missing values. There are various ways to handle this, including (1) imputation of missing values, (2) mean-centering values around 0, or (3)
Calculating the SVD matrices can be computationally expensive, although calculating ratings once the factorization is completed is very fast. You may need to create a subset of your data for SVD calculations to be successfully performed, especially on a machine with a small RAM footprint.
The dataset was retrieved from dataworld website. The data includes user ratings for different Amazon electronic devices from 1-5 with 5 being the highest rating. We are only interested in the name of the device , username and rating, so we extract only those columns as a start.
Amazon Rroduct Review Dataset
https://data.world/datafiniti/consumer-reviews-of-amazon-products
amz<- read.csv("https://raw.githubusercontent.com/apag101/Data612/master/Projects/Project3/amazon.csv", header = TRUE)
amz<- subset(amz, select=c('name','reviews.rating', 'reviews.username'))
A review of the data show that there are some NA value. We use complete.cases to remove any rows with NA values.
## Observations: 1,597
## Variables: 3
## $ name <fct> Kindle Paperwhite, Kindle Paperwhite, Kindle Paper...
## $ reviews.rating <int> 5, 5, 4, 5, 5, NA, NA, NA, NA, NA, NA, NA, NA, 4, ...
## $ reviews.username <fct> Cristina M, Ricky, Tedd Gardiner, Dougal, Miljan D...
## Observations: 1,177
## Variables: 3
## $ name <fct> Kindle Paperwhite, Kindle Paperwhite, Kindle Paper...
## $ reviews.rating <int> 5, 5, 4, 5, 5, 4, 5, 5, 4, 5, 4, 4, 5, 5, 5, 4, 5,...
## $ reviews.username <fct> Cristina M, Ricky, Tedd Gardiner, Dougal, Miljan D...
##
## 1 2 3 4 5
## 42 34 124 236 741
With the removal of NA values rows go from 1597 to 1177
In the next section, the data is transformed into a matrix with username as the rows, device name as the column and ratings as the matrix data.
amz.dat <- matrix(data=amz$reviews.rating,nrow=length(unique(amz$reviews.username)),ncol=length(unique(amz$name)))
rownames(amz.dat)<-c(paste(unique(amz$reviews.username)))
colnames(amz.dat)<-c(paste(unique(amz$name)))
glimpse(amz.dat)
## int [1:836, 1:62] 5 5 4 5 5 4 5 5 4 5 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:836] "Cristina M" "Ricky" "Tedd Gardiner" "Dougal" ...
## ..$ : chr [1:62] "Kindle Paperwhite" "Kindle Keyboard" "Certified Refurbished Amazon Fire TV (Previous Generation - 1st)" "Amazon Echo Dot Case (fits Echo Dot 2nd Generation only) - Indigo Fabric" ...
The resulting matrix has 836 rows and 62 columns.
Singular Value Decomposition(SVD) is a method of dimensionality reduction. The functions begins with an input matrix of mxn (eg. M rows r columns, M documents, n terms) that is a product of 3 matrices U, D, V.
The final formula equates to: A = \(U \sum_{} V^T\)
In this assignment we are using the irlba library which enables us to state the number of singular values to estimate. In this example we set nv to 30. The list belows shows the top 10 D, U, V values as well as the number of iterations and total number of matrix vector products carried out.
amz.svd<-irlba(t(amz.dat), nv=30, maxit=200)
D values:
## [1] 993.00566 47.58560 39.83519 37.36527 37.03986 36.77521 36.60924
## [8] 36.20103 35.50962 35.16535
U values:
## [1] 0.12990172 0.12490559 0.12613131 0.13076546 0.12528140 0.12350393
## [7] -0.08328865 -0.08634993 0.16311025 -0.02753623
V , iterations and matrix products:
## [1] 0.035011953 0.033605630 0.034884376 0.033745851 0.033232679 0.035488864
## [7] 0.005053155 0.023277869 0.012220521 0.040270522
## [1] 12
## [1] 148
This final code shows the SVD matrix calculation carried out. The comparison of the first 10 original values and new values are similar.
amz.svdi<-amz.svd$u %*% amz.svd$d %*% t(amz.svd$v[1,])
amz.svdi[1:10]
## [1] 4.020200 3.946102 3.422356 5.159498 3.165943 4.292214 4.882875 2.983146
## [9] 3.701984 4.304987
amz.dat[1:10]
## [1] 5 5 4 5 5 4 5 5 4 5
This next group of code does the SVD caculations manually. We are doing the dot product of the original matrix and the transpose of the original matrix then calculation the eigen value to derive the V values. A comparison of the original V values match.
atransv<- amz.dat %*% t(amz.dat)
atransv.e <- eigen(atransv)
head(atransv.e$vectors[1:6])
## [1] -0.03501195 -0.03360563 -0.03488438 -0.03374585 -0.03323268 -0.03548886
head(amz.svd$v[1:6])
## [1] 0.03501195 0.03360563 0.03488438 0.03374585 0.03323268 0.03548886
Here we do the reverse, the dot product of the transpose of the original matrix and the original matrix and calculate the eigen values to derive the U values. A comparison of the original U values match.
atransu<- t(amz.dat) %*% amz.dat
atransu.e <- eigen(atransu)
head(atransu.e$vectors[1:6])
## [1] -0.1299017 -0.1249056 -0.1261313 -0.1307655 -0.1252814 -0.1235039
head(amz.svd$u[1:6])
## [1] 0.1299017 0.1249056 0.1261313 0.1307655 0.1252814 0.1235039
For the diagonal, we calculate the square root of the V values and multiple by the V diagonals. This sets all values to 0 except the 3 diagonals. The comparison of the diagonals are very close, but not a precise match.
r <- sqrt(atransv.e$values)
r <- r * diag(length(r))[,1:3]
r[1:3,]
## [,1] [,2] [,3]
## [1,] 993.0057 0.0000 0.00000
## [2,] 0.0000 47.5856 0.00000
## [3,] 0.0000 0.0000 39.83519
amz.svd$d[1:3]
## [1] 993.00566 47.58560 39.83519
Here we take the 3 matrices and calculate the SVD to get final values. The final values are very close to originals and values calculated by the irlba SVD function.
atransm<-atransu.e$vectors[1:3]*-1 %*% r[1:3] %*% t(atransv.e$vectors)[1:3]*-1
atransm[1:3]
## [1] 4.516302 4.342601 4.385216
amz.dat[1:3]
## [1] 5 5 4
amz.svdi[1:3]
## [1] 4.020200 3.946102 3.422356
SVD allows us to find similarity of user and concepts by reducing dimensions. With the new matrices you can now select a user space and find if how similar a user is to others based on perference by using the cross product of user rating and ratings for similar devices.
uq<- matrix(c(0),nrow=nrow(amz.dat))
uq[1,1]<-5
sqrt(uq[1:3]%*%atransm)
## [,1]
## [1,] 4.752001
atransm[1]
## [1] 4.516302
https://data.world/datafiniti/consumer-reviews-of-amazon-products
https://rpubs.com/aaronsc32/singular-value-decomposition-r
Code used in analysis
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
warning = FALSE
)
#knitr::opts_chunk$set(echo = TRUE)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(mlbench)
library(e1071)
library(fpp2)
library(mlr)
library(recommenderlab)
library(irlba)
amz<- read.csv("https://raw.githubusercontent.com/apag101/Data612/master/Projects/Project3/amazon.csv", header = TRUE)
amz<- subset(amz, select=c('name','reviews.rating', 'reviews.username'))
namz<-nrow(amz)
glimpse(amz)
amz<-subset(amz, complete.cases(amz))
namz2<-nrow(amz)
glimpse(amz)
table(amz$reviews.rating)
amz.dat <- matrix(data=amz$reviews.rating,nrow=length(unique(amz$reviews.username)),ncol=length(unique(amz$name)))
rownames(amz.dat)<-c(paste(unique(amz$reviews.username)))
colnames(amz.dat)<-c(paste(unique(amz$name)))
glimpse(amz.dat)
amz.svd<-irlba(t(amz.dat), nv=30, maxit=200)
amz.svd$d[1:10]
head(amz.svd$u)[1:10]
head(amz.svd$v)[1:10]
amz.svd$iter
amz.svd$mprod
amz.svdi<-amz.svd$u %*% amz.svd$d %*% t(amz.svd$v[1,])
amz.svdi[1:10]
amz.dat[1:10]
atransv<- amz.dat %*% t(amz.dat)
atransv.e <- eigen(atransv)
head(atransv.e$vectors[1:6])
head(amz.svd$v[1:6])
atransu<- t(amz.dat) %*% amz.dat
atransu.e <- eigen(atransu)
head(atransu.e$vectors[1:6])
head(amz.svd$u[1:6])
r <- sqrt(atransv.e$values)
r <- r * diag(length(r))[,1:3]
r[1:3,]
amz.svd$d[1:3]
atransm<-atransu.e$vectors[1:3]*-1 %*% r[1:3] %*% t(atransv.e$vectors)[1:3]*-1
atransm[1:3]
amz.dat[1:3]
amz.svdi[1:3]
uq<- matrix(c(0),nrow=nrow(amz.dat))
uq[1,1]<-5
sqrt(uq[1:3]%*%atransm)
atransm[1]