INSTRUCTIONS:
The primary objectives of this case are as follows: *Understand data challenge in developing recommender systems and how to handle them** Apply different type of recommendation algorithms Learn how to validate the models Understand the challenges of implementing a real-world analytics solution. Students should be able to answer the following questions What is the difference in the recommender system requirements between Bigbasket and other e-commerce companies such as Amazon? Which recommender system is more appropriate for Bigbasket? How do you find similarity between products based on what customers buy in different baskets? What similarities are appropriate in this context and why? How do we build a Smart Basket for a customer? What testing strategy should be applied to find out how the model works?

setwd("/Users/carolyn.khalil/Desktop/R-Tutorial")
RawBasket <- read.csv("R-tutorial/Data/BigBasket.csv")

Library installation:

library(reshape2)
library(recommenderlab)

## Loading required package: Matrix

## Loading required package: arules

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## Loading required package: proxy

## 
## Attaching package: 'proxy'

## The following object is masked from 'package:Matrix':
## 
##     as.matrix

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loading required package: registry

## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy

-dcast(DATA, formula=ROW_VAR ~ COL_VAR, value.var=“VAL_VAR”,fun.aggregate=AGGR_FN) - This function takes the dataset (DATA), a formula which describes the column used to define the rows (ROW_VAR) and the column used to determine the columns (COL_VAR) in the new wide dataset. - You also need to specify a value column (VAL_VAR) to determine what weight to put in the wide data entries. - fun.aggregate is applied when there are multiple entries for a data value in the new dataset. - Example functions include sum or average

View data :

RawBasket[1:6,]

##   Member   Order      SKU       Created.On  Description
## 1 M09736 6468572 34993740 22-09-2014 22:45 Other Sauces
## 2 M09736 6468572 15669800 22-09-2014 22:45      Cashews
## 3 M09736 6468572 34989501 22-09-2014 22:45   Other Dals
## 4 M09736 6468572  7572303 22-09-2014 22:45      Namkeen
## 5 M09736 6468572 15669856 22-09-2014 22:45        Sugar
## 6 M09736 6468572 15668478 22-09-2014 22:45       Banana

RawBasket$values <- 1
RawBasket[1:6,]

##   Member   Order      SKU       Created.On  Description values
## 1 M09736 6468572 34993740 22-09-2014 22:45 Other Sauces      1
## 2 M09736 6468572 15669800 22-09-2014 22:45      Cashews      1
## 3 M09736 6468572 34989501 22-09-2014 22:45   Other Dals      1
## 4 M09736 6468572  7572303 22-09-2014 22:45      Namkeen      1
## 5 M09736 6468572 15669856 22-09-2014 22:45        Sugar      1
## 6 M09736 6468572 15668478 22-09-2014 22:45       Banana      1

BigBasket <- dcast(RawBasket, formula=Member~SKU,value.var="values",fun.aggregate=sum)
BigBasket[1:7,1:20]

##   Member 6884195 7536640 7537167 7537178 7538018 7538388 7538394 7540256
## 1 M04158       0       0       1       1       0       0       0       2
## 2 M08075       0       0       0       0       0       0       0       0
## 3 M09303       0       0       0       0       0       0       0       0
## 4 M09736       0       0       0       0       0       0       0       0
## 5 M12050       0       0       0       0       0       0       0       0
## 6 M12127       0       0       0       0       0       0       0       0
## 7 M14746       0       0       0       0       0       0       0       0
##   7540257 7540446 7541234 7541236 7541573 7542208 7542424 7543240 7543241
## 1       0       0       0       0       3       1       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       2       0       0       0
## 6       0       0       0       0       0       0       0       0       2
## 7       0       0       0       0       0       0       0       0       1
##   7543244 7543247
## 1       0       0
## 2       0       0
## 3       0       0
## 4       0       0
## 5       0       0
## 6       2       0
## 7       0       1

- The above command will put the sum of RawBasket$values for each SKU/member combination in the BigBasket entries(to go from a wide dataset to a long dataset use melt function from reshape2) - We want the dataset to be entirely numeric - remove the Member column and use that column as the row names

row.names(BigBasket) <- BigBasket$Member
BigBasket <- BigBasket[,-1]
BigBasket[1:7,1:7]

##        6884195 7536640 7537167 7537178 7538018 7538388 7538394
## M04158       0       0       1       1       0       0       0
## M08075       0       0       0       0       0       0       0
## M09303       0       0       0       0       0       0       0
## M09736       0       0       0       0       0       0       0
## M12050       0       0       0       0       0       0       0
## M12127       0       0       0       0       0       0       0
## M14746       0       0       0       0       0       0       0

2. Calculating similarity if needed

ManhDist <- sum(abs(BigBasket[1,]- BigBasket[2,]))/ncol(BigBasket)
print(ManhDist)

## [1] 0.437067

Manhattan distance = 0.4373

Below is the cosine similarity sum of (a x b) divided by sqrt of (sum of a^2 x sum of b^2 )

cosim <- function(Customer1, Customer2){
  Customer1[is.na(Customer1)] <-0
  Customer2[is.na(Customer2)] <-0
  cust12 <- Customer1 * Customer2
  cust11 <- Customer1 * Customer1
  cust22 <- Customer2 * Customer2
  
  sim <- sum(cust12)/sqrt(sum(cust11)*sum(cust22))
  
  return(sim)
  
}
cosim(BigBasket[1,],BigBasket[5,])

## [1] 0.2323311

The cosine simularity between the data set above is 23.23%

Mcosim <- function(Customer1, Customer2){
  cust1 <- as.matrix(Customer1)
  cust2 <- as.matrix(Customer2)
  cust1[is.na(cust1)] <-0
  cust2[is.na(cust2)] <-0
  
  sim <- cust1 %*% t(cust2) /sqrt(cust1 %*% t(cust1) * cust2 %*% t(cust2))
  return(sim)
}
Mcosim(BigBasket[1,],BigBasket[5,])

##           M12050
## M04158 0.2323311

Adjusted cosine similarity

Adjcosim <- function(Customer1, Customer2){
  Customer1[is.na(Customer1)] <-0
  Customer2[is.na(Customer2)] <-0
  cust12 <- (Customer1-rowMeans(Customer1)) * (Customer2-rowMeans(Customer2))
  cust11 <- Customer1 * Customer1
  cust22 <- Customer2 * Customer2
  
  sim <- sum(cust12)/sqrt(sum(cust11)*sum(cust22))
  
  return(sim)
  
}
Adjcosim(BigBasket[1,],BigBasket[5,])

## [1] 0.2009851

The adjusted cosine similarity is ~ 20.1%

3. Building the recommender

when building a recommendation system we will want to predict missing data so lets use NA instead of 0 - recommenderlab uses sparse Matrix that does not store NAs but stores 0s (for products with zero rating)

BigBasket[BigBasket==0] <- NA


dim(BigBasket)

## [1]  106 1732

BigBasket <- as.matrix(BigBasket)
R <- as(BigBasket,"realRatingMatrix")
R

## 106 x 1732 rating matrix of class 'realRatingMatrix' with 15485 ratings.

# see the rating of users
getRatingMatrix(R)

## 106 x 1732 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 33 column names '6884195', '7536640', '7537167' ... ]]

##                                                                             
## M04158  . . 1 1 . . . 2 . . . .  3 1 . .  . .  . . . . . . . . . . . . . . .
## M08075  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M09303  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M09736  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M12050  . . . . . . . . . . . .  . 2 . .  . .  . . . . . . . . . 1 . . . . .
## M12127  . . . . . . . . . . . .  . . . .  2 2  . . . . . . . . . . . . . . .
## M14746  . . . . . . . . . . . .  . . . .  1 .  1 . . . . . . . . . . . . . .
## M16218  . . . 1 . . . . . . . .  1 . . .  . .  . . . . . . . . . . . . . . .
## M16611  . . . . . . . . . . . .  . 1 . .  . .  . . . . . . . . . . . . . . .
## M18732  . 1 . . . . . . . . . .  . . . .  1 .  . . . . . . . . . . . . . . .
## M22037  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . 1 . .
## M25900  . 1 . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M27458  1 . . . 2 . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . 2
## M27871  1 . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M31101  . . . . 1 . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M31908  . . . . . . . . . . . .  . . . .  . .  . . . . . 1 . . . . . . . . .
## M31966  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . 1 . 1
## M32039  . . . . . . . . . . . .  2 . . .  . .  . . . . . . . . . . . . . . .
## M32409  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M32449  . . . . . . . . . . . .  4 . . .  . .  . . . . . . . . . . . . . . .
## M32480  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M32655  . . . . . . . . . . . .  1 . . .  . .  . . . . . . . . . . . . . . .
## M33064  . . . . . . . . . . 1 .  . . . .  . .  . . . . . . . . . . . . . . .
## M33422  . . 1 1 . . 1 . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M33491  . . 2 . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M33558  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M33745  . . 1 . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M33767  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M34566  . . . . . . . . . . . .  1 . . .  . .  . . . . . . . . . . . . . . .
## M35070  2 . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M35464  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M35538  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M35649  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M36366  . . . . . . . . 1 . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M36432  1 . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . 2 . .
## M36702  . . . . . . . . . . . .  . . . .  1 .  . . 1 . . . . . . . . . . . .
## M36876  2 . . . . . . . . . . .  1 . . .  . .  . . . . . . . . . . . . . . .
## M37253 11 . . . . . . . . . . .  1 . . .  . .  . . . . . . . . . . . . . . 1
## M37600  . . . . . . . . . . . .  . . . .  8 .  . . . . . . . . . . . . . . 2
## M38622  . . . . . . . . . . . .  5 . . 6 16 . 10 . . . 3 . . . . . . . . . .
## M39021  . . . . . . . . . . . .  1 . . .  . .  . . . . . . . . . . . . . . .
## M40184  . . . . . . . . . . . .  3 . . .  . .  . . . . . . . . . . . . . . 1
## M41700  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M41747  . . . . . . . . . . . .  . . . . 22 .  . . . . . . . . . . . . . . .
## M41781  1 . . . . . . 1 . . . .  1 . . .  3 .  . . . . . 1 . . . . . . 1 . .
## M42182  1 . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M42513  . . . . . . . . . 1 . .  . . . .  . .  . . . . . . . . . . . . . . .
## M42827  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M43189  . . . . . . . . . . . .  3 . . .  . .  . . . . . . . 1 . . . . . . .
## M43831  . . . . 1 . . . . . . .  6 1 . .  . .  . . . . . . . . . . . . . . .
## M43977  . . . . . . . . 3 . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M44156  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M45375  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M45470  . . . . . . . . 1 . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M46325  . . 1 1 . . . . . . . .  3 . . .  . .  . . . . . . . . . . . . . . .
## M46328  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . 1 . . . .
## M46575  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M46687  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M47229  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M48101  . . . . . . . . . . 1 .  . . . .  . .  . . . . . . . . . . . . . . .
## M48154  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M48938  . . . . . . . . . . . .  . . . .  . .  . . . . 1 . . . . . . . . . .
## M50038  . . . . . . . . . . . .  . . . .  . .  . . . . . . 1 . . . . . . . .
## M50094  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . 2
## M50420  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M50767  . . . . . . . . . . . . 25 . . .  . .  . . . . . . . . . . . . . . .
## M51043  . . . . . . . . . . . .  2 . . .  . .  . . . . . . . . . . . . . . .
## M51278  . . . . 4 . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . 2
## M52629  . . . . . . . . . . . .  . . . .  . .  . . . 1 . . . . . . . . . . .
## M54100  . . . . . . . . . . . .  1 . . .  . .  . . . . . . . . . . . . . . .
## M54345  . . . . . . . 1 1 . . .  . 2 . .  . .  . . . . . . . . . . . . . . 1
## M54382  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . 2
## M54619  . . . . . . . . . . . .  3 . . .  . .  . . . . . . . . . . . . . . .
## M54796  . . . . . . . . . . . .  1 . . .  . .  . . . . . . . . . . . . . . .
## M55932  . . . . . . . . . . . 2  2 . . .  1 .  . . . . . . . . . . . . . . 1
## M56255  . . . . . . . . . . . .  1 . . .  . .  . . . . . . . . . . . . . . .
## M56309  . . . . . . . . . . . .  . . . .  . .  . . . . . . 5 . . . . . . . .
## M56368  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . 1 . . . . . .
## M56489  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M56516  4 . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . 1 . . .
## M56897  . . . . . . . . . . . .  . . 1 .  . .  . . . . . . . . . . . . . . .
## M57093  . . . . . . . . . . . .  1 . . .  . .  . . . . . . . . . . . . . . 1
## M57327  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . 1 .
## M57354  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M58761  1 . . . . . . 1 . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M58939  . . . . . 1 1 . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M59012  . . . . . . . . . . . .  . . . .  1 .  . . . . 2 . . . . . . . . . .
## M59232  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . 2 . .
## M62656  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . 1 . . . . . .
## M62833  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M63404  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M64055  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . 1
## M64379  . . . . . . . . 1 . . .  . . . .  . .  . 1 . . . . . . . . . . . . .
## M76390  1 . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M77779  2 . . . . . . . . . . .  8 . . .  . .  . . . . . . . . . . . . . . .
## M78365  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M78720  1 . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M82651  . . . . . . . . . . . .  . 1 . . 10 .  1 . . . . . . . . . . . . . .
## M84827  . . . . . . . . . . . .  . . . 2  2 .  . 4 . . . . . . . . . . . . .
## M86304  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M86572  . . . . . . . . . . . .  1 . . .  . .  . . . . . . . . . . . . . . 1
## M90375  . . . . . . . . . . . .  . 1 . .  . .  . . . . . . . . . . . . . . .
## M91098  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M96365  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M99030  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
## M99206  . . . . . . . . . . . .  . . . .  . .  . . . . . . . . . . . . . . .
##              
## M04158 ......
## M08075 ......
## M09303 ......
## M09736 ......
## M12050 ......
## M12127 ......
## M14746 ......
## M16218 ......
## M16611 ......
## M18732 ......
## M22037 ......
## M25900 ......
## M27458 ......
## M27871 ......
## M31101 ......
## M31908 ......
## M31966 ......
## M32039 ......
## M32409 ......
## M32449 ......
## M32480 ......
## M32655 ......
## M33064 ......
## M33422 ......
## M33491 ......
## M33558 ......
## M33745 ......
## M33767 ......
## M34566 ......
## M35070 ......
## M35464 ......
## M35538 ......
## M35649 ......
## M36366 ......
## M36432 ......
## M36702 ......
## M36876 ......
## M37253 ......
## M37600 ......
## M38622 ......
## M39021 ......
## M40184 ......
## M41700 ......
## M41747 ......
## M41781 ......
## M42182 ......
## M42513 ......
## M42827 ......
## M43189 ......
## M43831 ......
## M43977 ......
## M44156 ......
## M45375 ......
## M45470 ......
## M46325 ......
## M46328 ......
## M46575 ......
## M46687 ......
## M47229 ......
## M48101 ......
## M48154 ......
## M48938 ......
## M50038 ......
## M50094 ......
## M50420 ......
## M50767 ......
## M51043 ......
## M51278 ......
## M52629 ......
## M54100 ......
## M54345 ......
## M54382 ......
## M54619 ......
## M54796 ......
## M55932 ......
## M56255 ......
## M56309 ......
## M56368 ......
## M56489 ......
## M56516 ......
## M56897 ......
## M57093 ......
## M57327 ......
## M57354 ......
## M58761 ......
## M58939 ......
## M59012 ......
## M59232 ......
## M62656 ......
## M62833 ......
## M63404 ......
## M64055 ......
## M64379 ......
## M76390 ......
## M77779 ......
## M78365 ......
## M78720 ......
## M82651 ......
## M84827 ......
## M86304 ......
## M86572 ......
## M90375 ......
## M91098 ......
## M96365 ......
## M99030 ......
## M99206 ......
## 
##  .....suppressing 1699 columns in show(); maybe adjust 'options(max.print= *, width = *)'
##  ..............................

e <- evaluationScheme(R, method="split", train=0.8, given=15, goodRating=5)
e

## Evaluation scheme with 15 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=5.000000
## Data set: 106 x 1732 rating matrix of class 'realRatingMatrix' with 15485 ratings.

train UBCF cosine similarity models

Non-normalized:

UBCF_NoNorm_Cos <- Recommender(getData(e, "train"), "UBCF", 
                               param=list(normalize = NULL, method="Cosine"))

Centered:

UBCF_Center_Cos <- Recommender(getData(e, "train"), "UBCF", 
                               param=list(normalize = "center",method="Cosine"))

Z-score normalization:

UBCF_Zscore_Cos <- Recommender(getData(e, "train"), "UBCF", 
                               param=list(normalize = "Z-score",method="Cosine"))

Predict :

p1 <- predict(UBCF_NoNorm_Cos, getData(e, "known"), type="ratings")
p2 <- predict(UBCF_Center_Cos, getData(e, "known"), type="ratings")
p3 <- predict(UBCF_Zscore_Cos, getData(e, "known"), type="ratings")

error_UCOS <- rbind(
  UBCF_NoNorm_Cos = calcPredictionAccuracy(p1, getData(e, "unknown")),
  UBCF_Center_Cos = calcPredictionAccuracy(p2, getData(e, "unknown")),
  UBCF_Zscore_Cos = calcPredictionAccuracy(p3, getData(e, "unknown"))
)
knitr::kable(error_UCOS, "markdown")

	RMSE	MSE	MAE
UBCF_NoNorm_Cos	5.676082	32.21790	3.002829
UBCF_Center_Cos	5.420516	29.38199	3.465806
UBCF_Zscore_Cos	5.557727	30.88832	3.478942

What can we conclude from MAE (mean absolute error) and MSE (mean squared error)? The MAE (mean absolute error) refers to the mean of the absolute values of the prediction errors . This is the avg of the distances between the values predicted and the actual values for the test sample.

MSE (mean sq error) - absolute values as well (ie. doesn’t matter is the value was under or over predicted by 2) but as the values are squared the “outliers” or larger errors are penalized more.

From these UBCF models we can see that no normalization had the least MAE meaning least average deviation from the actual values. However, as the MSE is relatively large compared to a MAE of 3.58 we can see that it did have values that were largely under or over estimated.

hist(as.vector(as(p3, "matrix")), main = "Distrib. of Predicted Values for UBCF_Zscore_Cos Model", col = "yellow", xlab = "Predicted Ratings")

A histogram of the Z-score modelâs predicted values demonstrates that their distribution is nearly normal.

Evaluation of a top-N recommender algorithms

We need to know how many of the recommended items (e.g., n=10) were actually correctly recommended had goodRating (sales) in the masked data in the test set

ml.cross <- evaluationScheme(R, method = "cross", k = 4, given = -5, goodRating = 5)
ml.cross

## Evaluation scheme using all-but-5 items
## Method: 'cross-validation' with 4 run(s).
## Good ratings: >=5.000000
## Data set: 106 x 1732 rating matrix of class 'realRatingMatrix' with 15485 ratings.

algorithms <- list(
  "UBCF_NoNorm_Cos" = list(name="UBCF", param=list(normalize = NULL, method="Cosine",nn=50)),
  "UBCF_Center_Cos" = list(name="UBCF", param=list(normalize = "center", method="Cosine",nn=50)),
  "UBCF_Zscore_Cos" = list(name="UBCF", param=list(normalize = "Z-score", method="Cosine",nn=50))
)

ml.cross.results <- evaluate(ml.cross, algorithms, type = "topNList", n = c(1, 3, 5, 10, 15, 20))

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.09sec] 
##   2  [0sec/0.11sec] 
##   3  [0sec/0.08sec] 
##   4  [0sec/0.1sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.06sec] 
##   2  [0sec/0.08sec] 
##   3  [0sec/0.06sec] 
##   4  [0sec/0.22sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.08sec] 
##   2  [0sec/0.07sec] 
##   3  [0sec/0.08sec] 
##   4  [0sec/0.07sec]

ml.cross.results

## List of evaluation results for 3 recommenders:
## Evaluation results for 4 folds/samples using method 'UBCF'.
## Evaluation results for 4 folds/samples using method 'UBCF'.
## Evaluation results for 4 folds/samples using method 'UBCF'.

names(ml.cross.results)

## [1] "UBCF_NoNorm_Cos" "UBCF_Center_Cos" "UBCF_Zscore_Cos"

ml.cross.results[["UBCF_NoNorm_Cos"]]

## Evaluation results for 4 folds/samples using method 'UBCF'.

getConfusionMatrix(ml.cross.results[["UBCF_NoNorm_Cos"]])[[1]]

##            TP         FP        FN       TN  precision     recall        TPR
## 1  0.07142857  0.9285714 1.1428571 1734.857 0.07142857 0.03787879 0.03787879
## 3  0.14285714  2.8571429 1.0714286 1732.929 0.04761905 0.10606061 0.10606061
## 5  0.25000000  4.7500000 0.9642857 1731.036 0.05000000 0.21969697 0.21969697
## 10 0.32142857  9.6785714 0.8928571 1726.107 0.03214286 0.28030303 0.28030303
## 15 0.42857143 14.5714286 0.7857143 1721.214 0.02857143 0.38636364 0.38636364
## 20 0.42857143 19.5714286 0.7857143 1716.214 0.02142857 0.38636364 0.38636364
##             FPR
## 1  0.0005349271
## 3  0.0016459861
## 5  0.0027364723
## 10 0.0055758439
## 15 0.0083946427
## 20 0.0112751835

plot(ml.cross.results, annotate=c(1,3), legend="topleft")

ml.cross.results[["UBCF_Center_Cos"]]

## Evaluation results for 4 folds/samples using method 'UBCF'.

getConfusionMatrix(ml.cross.results[["UBCF_Center_Cos"]])[[1]]

##            TP         FP        FN       TN  precision     recall        TPR
## 1  0.07142857  0.9285714 1.1428571 1734.857 0.07142857 0.03787879 0.03787879
## 3  0.14285714  2.8571429 1.0714286 1732.929 0.04761905 0.10606061 0.10606061
## 5  0.21428571  4.7857143 1.0000000 1731.000 0.04285714 0.17424242 0.17424242
## 10 0.28571429  9.7142857 0.9285714 1726.071 0.02857143 0.23484848 0.23484848
## 15 0.35714286 14.6428571 0.8571429 1721.143 0.02380952 0.32575758 0.32575758
## 20 0.39285714 19.6071429 0.8214286 1716.179 0.01964286 0.33712121 0.33712121
##             FPR
## 1  0.0005349271
## 3  0.0016459861
## 5  0.0027570450
## 10 0.0055964166
## 15 0.0084358119
## 20 0.0112957444

ml.cross.results[["UBCF_Zscore_Cos"]]

## Evaluation results for 4 folds/samples using method 'UBCF'.

getConfusionMatrix(ml.cross.results[["UBCF_Zscore_Cos"]])[[1]]

##            TP         FP        FN       TN  precision     recall        TPR
## 1  0.07142857  0.9285714 1.1428571 1734.857 0.07142857 0.03787879 0.03787879
## 3  0.14285714  2.8571429 1.0714286 1732.929 0.04761905 0.10606061 0.10606061
## 5  0.21428571  4.7857143 1.0000000 1731.000 0.04285714 0.17424242 0.17424242
## 10 0.25000000  9.7500000 0.9642857 1726.036 0.02500000 0.21969697 0.21969697
## 15 0.32142857 14.6785714 0.8928571 1721.107 0.02142857 0.28030303 0.28030303
## 20 0.39285714 19.6071429 0.8214286 1716.179 0.01964286 0.33712121 0.33712121
##             FPR
## 1  0.0005349271
## 3  0.0016459861
## 5  0.0027570450
## 10 0.0056170131
## 15 0.0084563847
## 20 0.0112957444

What can we conclude about using different algorithms from TP and precision? Are the results so different from what you would have expected from MAE/MSE? TP: true positive, FP: false positive, FN: false Negative, TN = true negative , TPR = true positive rate FPR = false positive rate.

Precision is the fraction of relevant instances among the retrieved instances. Recall is the fraction of the total amount of relevant instances that were actually retrieved. Both are based on an understanding and measure of relevance.

Looking at these results, we can again see that the no normalization model has the highest TP, TN, precision and recall values and the lowest FP and FN values for ⅚ lines above.

However, looking at the proportion of TP to FP this data may not reflect a mathematically sould model. Further models should be explored.

Using a 0-1 data set

For comparison we will check how the algorithms compare given less information. We convert the data set into 0-1 data and instead of a all-but-5 we use the given-3 scheme.

R_binary <- binarize(R, minRating=5)
R_binary

## 106 x 1732 rating matrix of class 'binaryRatingMatrix' with 3550 ratings.

## 106 x 1732 rating matrix of class 'realRatingMatrix' with 15485 ratings.

ml.cross_binary <- evaluationScheme(R_binary, method = "cross", k = 4, given = 3)
ml.cross_binary

## Evaluation scheme with 3 items given
## Method: 'cross-validation' with 4 run(s).
## Good ratings: NA
## Data set: 106 x 1732 rating matrix of class 'binaryRatingMatrix' with 3550 ratings.

ml.cross_binary.results <- evaluate(ml.cross_binary, algorithms, type = "topNList", n = c(1, 3, 5, 10, 15, 20))

## UBCF run fold/sample [model time/prediction time]
##   1

## Warning: Unknown parameters: normalize

## Available parameter (with default values):
## method    =  jaccard
## nn    =  25
## weighted  =  TRUE
## sample    =  FALSE
## verbose   =  FALSE
## [0sec/0.15sec] 
##   2

## Warning: Unknown parameters: normalize

## Available parameter (with default values):
## method    =  jaccard
## nn    =  25
## weighted  =  TRUE
## sample    =  FALSE
## verbose   =  FALSE
## [0sec/0.33sec] 
##   3

## Warning: Unknown parameters: normalize

## Available parameter (with default values):
## method    =  jaccard
## nn    =  25
## weighted  =  TRUE
## sample    =  FALSE
## verbose   =  FALSE
## [0sec/0.14sec] 
##   4

## Warning: Unknown parameters: normalize

## Available parameter (with default values):
## method    =  jaccard
## nn    =  25
## weighted  =  TRUE
## sample    =  FALSE
## verbose   =  FALSE
## [0sec/0.14sec] 
## UBCF run fold/sample [model time/prediction time]
##   1

## Warning: Unknown parameters: normalize

## Available parameter (with default values):
## method    =  jaccard
## nn    =  25
## weighted  =  TRUE
## sample    =  FALSE
## verbose   =  FALSE
## [0sec/0.15sec] 
##   2

## Warning: Unknown parameters: normalize

## Available parameter (with default values):
## method    =  jaccard
## nn    =  25
## weighted  =  TRUE
## sample    =  FALSE
## verbose   =  FALSE
## [0sec/0.14sec] 
##   3

## Warning: Unknown parameters: normalize

## Available parameter (with default values):
## method    =  jaccard
## nn    =  25
## weighted  =  TRUE
## sample    =  FALSE
## verbose   =  FALSE
## [0sec/0.14sec] 
##   4

## Warning: Unknown parameters: normalize

## Available parameter (with default values):
## method    =  jaccard
## nn    =  25
## weighted  =  TRUE
## sample    =  FALSE
## verbose   =  FALSE
## [0sec/0.28sec] 
## UBCF run fold/sample [model time/prediction time]
##   1

## Warning: Unknown parameters: normalize

## Available parameter (with default values):
## method    =  jaccard
## nn    =  25
## weighted  =  TRUE
## sample    =  FALSE
## verbose   =  FALSE
## [0sec/0.14sec] 
##   2

## Warning: Unknown parameters: normalize

## Available parameter (with default values):
## method    =  jaccard
## nn    =  25
## weighted  =  TRUE
## sample    =  FALSE
## verbose   =  FALSE
## [0sec/0.15sec] 
##   3

## Warning: Unknown parameters: normalize

## Available parameter (with default values):
## method    =  jaccard
## nn    =  25
## weighted  =  TRUE
## sample    =  FALSE
## verbose   =  FALSE
## [0sec/0.28sec] 
##   4

## Warning: Unknown parameters: normalize

## Available parameter (with default values):
## method    =  jaccard
## nn    =  25
## weighted  =  TRUE
## sample    =  FALSE
## verbose   =  FALSE
## [0sec/0.14sec]

ml.cross_binary.results

## List of evaluation results for 3 recommenders:
## Evaluation results for 4 folds/samples using method 'UBCF'.
## Evaluation results for 4 folds/samples using method 'UBCF'.
## Evaluation results for 4 folds/samples using method 'UBCF'.

names(ml.cross_binary.results)

## [1] "UBCF_NoNorm_Cos" "UBCF_Center_Cos" "UBCF_Zscore_Cos"

ml.cross_binary.results[["UBCF_NoNorm_Cos"]]

## Evaluation results for 4 folds/samples using method 'UBCF'.

getConfusionMatrix(ml.cross_binary.results[["UBCF_NoNorm_Cos"]])[[1]]

##           TP         FP       FN       TN precision     recall        TPR
## 1  0.8928571  0.1071429 30.03571 1697.964 0.8928571 0.03132528 0.03132528
## 3  2.5357143  0.4642857 28.39286 1697.607 0.8452381 0.08930276 0.08930276
## 5  3.6428571  1.3571429 27.28571 1696.714 0.7285714 0.12654320 0.12654320
## 10 5.6785714  4.3214286 25.25000 1693.750 0.5678571 0.19755992 0.19755992
## 15 7.2142857  7.7857143 23.71429 1690.286 0.4809524 0.25108609 0.25108609
## 20 8.7500000 11.2500000 22.17857 1686.821 0.4375000 0.30059833 0.30059833
##             FPR
## 1  6.281596e-05
## 3  2.729805e-04
## 5  7.979169e-04
## 10 2.543353e-03
## 15 4.583675e-03
## 20 6.622136e-03

plot(ml.cross_binary.results, annotate=c(1,3), legend="topleft")

FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Clear environment

rm(list = ls())

Clear packages

detach(“package:datasets”, unload = TRUE) # For base

Clear plots

dev.off() # But only if there IS a plot

Clear console

cat(“\014”) # ctrl+L

Case II

Carolyn Khalil

5/9/2020

Clear environment

Clear packages

Clear plots

Clear console

Clear mind :)