Overview

Both authors, Jaan and Kai, have worked on Movielens data before, however, not as a group and Kai used Python while Jaan used R. As Kai was transitioning from working in Python, the built-in MovieLens data provided an excellent way to become aquainted for the final project.

The main reason we chose to use this is because the dataset we plan on using for our final project may be too large for recommenderlab - we may seek additional libraries to leverage distributed computing.

Deliverable 1

Compare accuracy of two recommender systems.

Loading Libraries:

library(tidyverse)
library(recommenderlab)
library(reshape2)
library(data.table) # Faster 

Data

We are just going to use regular data contained in the MovieLense set.

data(MovieLense)
image(MovieLense, main = "Heatmap of the rating matrix")

Here we can see that the built-in MovieLense it is very sparse with many more movies than users.

Explore

First we split the data using the evaluationScheme method.

# Use eval scheme to split the data 70/30 Train/Test
eval_create <- proc.time()
mM <- evaluationScheme(MovieLense,
                       method = "split", 
                       train = 0.7, 
                       given = -5, # 10 recommendations per users
                       goodRating = 5)
eval_create <- proc.time() - eval_create

As shown in the below figure, the similarity of users is denoted by the darkness of the cell color (white to yellow to orange to red). We see that users share a varying amount of similarity depending on the pairing according to the pearson method.

# Review the similarity between the users
s_users <- similarity(mM@data[1:20,],
                      method = "pearson", 
                      which = "users")
image(as.matrix(s_users), main = "User Similarity, First 20 User Pairs")

Taking another look at the data, we review 4 histograms (see below) and provide some observations:

  • Top-left: Most users rate about 50 movies
  • Top-right: It appears that there some positive bias in the movie reviews
  • Bottom-row: Both plots support that there is some positive bias in the ratings
par(mfrow=c(2,2))
hist(rowCounts(mM@data), breaks=50, main = "Review Qty per Movie")
hist(colMeans(mM@data), breaks=50, main = "Mean Review per Movie")
hist(getRatings(mM@data), breaks=100, main = "Hist of Ratings, Original")
hist(getRatings(normalize(mM@data)), breaks=100, 
     main = "Hist of Ratings, Normalized")

Predictions

Below, we create 4 recommenders: IBCF, UBCF, SVD, and POPULAR.

# Build Recommenders to Compare
rm_ibcf <- Recommender(data = getData(mM, "train"), 
                       method = "IBCF")
rm_ubcf <- Recommender(data = getData(mM, "train"),
                       method = "UBCF")
rm_svd  <- Recommender(data = getData(mM, "train"), 
                       method = "SVD")
rm_popu <- Recommender(data = getData(mM, "train"),
                       method = "POPULAR")

# rm_ubcf2<- Recommender(data = getData(mM, "train"),
#                        method = "UBCF", 
#                        parameter = list(method = c("pearson", nn = 20)))
# rm_ubcf2@model$data@data

In the recommenderlab package, prediction can be used to return either top-N lists or predicted ratings for all user-item combinations using ratings for type.

p1 <- predict(rm_ibcf, getData(mM, "known"), type = "ratings")
p2 <- predict(rm_ubcf, getData(mM, "known"), type = "ratings")
p3 <- predict(rm_svd, getData(mM, "known"), type = "ratings")
p4 <- predict(rm_popu, getData(mM, "known"), type = "ratings")
# mM@data                 # 943 x 1664 rating matrix of class 
#                         #‘realRatingMatrix’ with 99392 ratings.
# getData(mM, "known")  # 283 x 1664 rating matrix of class 
#                         #‘realRatingMatrix’ with 849 ratings.
# getData(mM, "unknown")# 283 x 1664 rating matrix of class 
#                         #‘realRatingMatrix’ with 30758 ratings.

RSME Basline Evaluation

err_eval <- rbind(
  IBCF = calcPredictionAccuracy(p1, getData(mM, "unknown")),
  UBCF = calcPredictionAccuracy(p2, getData(mM, "unknown")),
  SVD  = calcPredictionAccuracy(p3, getData(mM, "unknown")),
  POPU = calcPredictionAccuracy(p4, getData(mM, "unknown")))
err_eval
##           RMSE       MSE       MAE
## IBCF 1.4538081 2.1135579 1.1236407
## UBCF 1.0616733 1.1271503 0.8564724
## SVD  1.0398814 1.0813533 0.8348754
## POPU 0.9989997 0.9980005 0.7913465

These three methods are very close in their error values. Further, an RMSE of \(\approx 1.2\) for SVD isn’t that bad considering that we didn’t normalize the ratings before building.

Prediction-Evaluation: Top-N

For basline cross-validation of recommendations, we use the cross method for SVD, UBCF, and POPU

recs_qtys <- c(1,3,5,7,9,11,13,15)

# Use eval scheme for cross-validation
cM <- evaluationScheme(MovieLense, # k = 10 default
                       method = "split", 
                       train = .7, 
                       k = 5,
                       given = -5,      # all but 5 recommendations per users
                       goodRating = 5)
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/3.96sec] 
##   2  [0.01sec/3.89sec] 
##   3  [0sec/3.87sec] 
##   4  [0.02sec/3.66sec] 
##   5  [0.02sec/3.57sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.01sec/1.64sec] 
##   2  [0.01sec/1.69sec] 
##   3  [0sec/1.7sec] 
##   4  [0sec/1.63sec] 
##   5  [0.02sec/1.56sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.73sec/0.43sec] 
##   2  [0.73sec/0.68sec] 
##   3  [0.72sec/0.41sec] 
##   4  [0.82sec/0.42sec] 
##   5  [0.73sec/0.42sec]
## List of evaluation results for 3 recommenders:
## Evaluation results for 5 folds/samples using method 'UBCF'.
## Evaluation results for 5 folds/samples using method 'POPULAR'.
## Evaluation results for 5 folds/samples using method 'SVD'.

Below, we plot the ROC curves and the Precision/Recall curves for the 3 different RS that we created. Units of measure along the TPR (true-postive rate) and FPR (false-poative rate) axes are very small - one can see that all 3 produce similar performance for Top-5 recs.

par(mfrow=c(1,2))
plot(rr, legend = "topleft")
plot(rr, "prec/rec", legend="topright")

Deliverable 2

Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity

As previous students have mentioned in our presentations, there can be more to a busines model than strait accuracy.

It can be delightful to find a movie on Netflix that is a bit outside of your regular style of movie but still surprises and excites. Generally, this involves taking some risks as to the rating.

We believe that users are naturally curious and do not always follow recommendations. They instead go to a general list of movies and pick one, not at random, but still one that the recommender might think would not appeal to them. This introduces natural experiments where a user decides if they like a film.

Now, there are many methods to achieve this, and any answer for which one is better would have to be validated with A/B testing on various KPIs relevant to the business. (More on this later.)

We believe that finding new movies that are less common than the ones recommended to users is a relevant business idea. To achieve this, we took the top 20 most similar users, based on Pearson Similarity, and took their highest rated movies. Because these users are similar, many movies rated 4-5 stars will be predicted 4-5 stars too. Instead, we take movies that are not common and recomend those. Some of these will not appeal to a certain user, but we think that novel, “hipster”, movies could add business value.

Our algorithm was slightly complicated to implement in R but the functions below work by:

  1. Taking Pearson Similarity on all users
  2. Finding the top 20 similar users.
  3. Calculate the mean rating for all movies and filter it to only have 4-5 start reviews
  4. Innerjoin these onto a dataframe with the counts of ratings for all movies and sort in reverse.
  5. Select the top 10 highly rated (within the group of 20), but not often rated movies and reccomend them.

Similarty Users

This will be the basis for our serendipitous recomender.

similarity_users <- similarity(MovieLense, 
                               method = "pearson", 
                               which = "users")

Extract SVD Predictions

Finally, the predictions for all results from the SVD algorithm are given below.

pr1 <- predict(rm_svd, getData(mM, "train"), type = "ratingMatrix")
pr2 <- predict(rm_svd, getData(mM, "known"), type = "ratingMatrix")
temp1 <- as(pr1, 'data.frame')
temp2 <-  as(pr2, 'data.frame')

svdPreds:

svdPred <- rbind(temp1, temp2)
dim(svdPred)
## [1] 1569152       3
svdPred %>% head()
##      user                                                 item     rating
## 1     723                                     Toy Story (1995) -0.2727273
## 661   723                                     GoldenEye (1995)  3.2949177
## 1321  723                                    Four Rooms (1995)  3.2541443
## 1981  723                                    Get Shorty (1995)  3.2822212
## 2641  723                                       Copycat (1995)  3.2531762
## 3301  723 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)  3.2657702
svdPred <- svdPred %>% spread(key = item, value = rating)
dim(svdPred)
## [1]  943 1665

We need to make the dataset wide to more easily add value in.

narrowDF <- as(MovieLense, 'data.frame')
narrowDF %>% head()
##     user                                                 item rating
## 1      1                                     Toy Story (1995)      5
## 453    1                                     GoldenEye (1995)      3
## 584    1                                    Four Rooms (1995)      4
## 674    1                                    Get Shorty (1995)      3
## 883    1                                       Copycat (1995)      3
## 969    1 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)      5
wideDF <- spread(data = narrowDF, key = item, value = rating)
wideDF[1:10, 1:10]
##    user 'Til There Was You (1997) 1-900 (1994) 101 Dalmatians (1996)
## 1     1                        NA           NA                     2
## 2    10                        NA           NA                    NA
## 3   100                        NA           NA                    NA
## 4   101                        NA           NA                     3
## 5   102                        NA           NA                    NA
## 6   103                        NA           NA                    NA
## 7   104                        NA           NA                    NA
## 8   105                        NA           NA                    NA
## 9   106                        NA           NA                    NA
## 10  107                        NA           NA                    NA
##    12 Angry Men (1957) 187 (1997) 2 Days in the Valley (1996)
## 1                    5         NA                          NA
## 2                    5         NA                          NA
## 3                   NA         NA                          NA
## 4                   NA         NA                          NA
## 5                   NA         NA                          NA
## 6                   NA         NA                          NA
## 7                   NA          1                           3
## 8                   NA         NA                          NA
## 9                   NA         NA                          NA
## 10                  NA         NA                          NA
##    20,000 Leagues Under the Sea (1954) 2001: A Space Odyssey (1968)
## 1                                    3                            4
## 2                                   NA                            5
## 3                                   NA                           NA
## 4                                   NA                           NA
## 5                                   NA                           NA
## 6                                   NA                           NA
## 7                                   NA                           NA
## 8                                   NA                           NA
## 9                                   NA                           NA
## 10                                  NA                           NA
##    3 Ninjas: High Noon At Mega Mountain (1998)
## 1                                           NA
## 2                                           NA
## 3                                           NA
## 4                                           NA
## 5                                           NA
## 6                                           NA
## 7                                           NA
## 8                                           NA
## 9                                           NA
## 10                                          NA

We need to get the userID and Movie Name for these ratings so that we can index them for our predictions.

userID <- wideDF$user
movieNames <- names(wideDF)[2:dim(wideDF)[2]]

Here is an example of the matrix. There are some NA values etc. but we can just ignore those as we have enough similar users.

pearsonMat = as.matrix(similarity_users) %>% 
  as_data_frame() 

names(pearsonMat) = userID
pearsonMat$user <- userID
pearsonMat %>% head()
## # A tibble: 6 x 944
##     `1`  `10`  `100` `101`  `102` `103` `104` `105`  `106` `107` `108`
##   <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 0     0.544  0.514 0.667  0.633 0.587 0.577 0.765  0.5   0.5   0.583
## 2 0.544 0      0.5   0.532  0.598 0.644 0.737 0.707  0.569 0.751 0.532
## 3 0.514 0.5    0     0.5   NA     0.5   0.519 0.585 NA     0.592 0.5  
## 4 0.667 0.532  0.5   0      1     0.5   0.5   0.737 NA     0.5   0.5  
## 5 0.633 0.598 NA     1      0     0.569 0.548 0.684  0.703 0.523 0.548
## 6 0.587 0.644  0.5   0.5    0.569 0     0.565 0.762  0.535 0.579 0.5  
## # ... with 933 more variables: `109` <dbl>, `11` <dbl>, `110` <dbl>,
## #   `111` <dbl>, `112` <dbl>, `113` <dbl>, `114` <dbl>, `115` <dbl>,
## #   `116` <dbl>, `117` <dbl>, `118` <dbl>, `119` <dbl>, `12` <dbl>,
## #   `120` <dbl>, `121` <dbl>, `122` <dbl>, `123` <dbl>, `124` <dbl>,
## #   `125` <dbl>, `126` <dbl>, `127` <dbl>, `128` <dbl>, `129` <dbl>,
## #   `13` <dbl>, `130` <dbl>, `131` <dbl>, `132` <dbl>, `133` <dbl>,
## #   `134` <dbl>, `135` <dbl>, `136` <dbl>, `137` <dbl>, `138` <dbl>,
## #   `139` <dbl>, `14` <dbl>, `140` <dbl>, `141` <dbl>, `142` <dbl>,
## #   `143` <dbl>, `144` <dbl>, `145` <dbl>, `146` <dbl>, `147` <dbl>,
## #   `148` <dbl>, `149` <dbl>, `15` <dbl>, `150` <dbl>, `151` <dbl>,
## #   `152` <dbl>, `153` <dbl>, `154` <dbl>, `155` <dbl>, `156` <dbl>,
## #   `157` <dbl>, `158` <dbl>, `159` <dbl>, `16` <dbl>, `160` <dbl>,
## #   `161` <dbl>, `162` <dbl>, `163` <dbl>, `164` <dbl>, `165` <dbl>,
## #   `166` <dbl>, `167` <dbl>, `168` <dbl>, `169` <dbl>, `17` <dbl>,
## #   `170` <dbl>, `171` <dbl>, `172` <dbl>, `173` <dbl>, `174` <dbl>,
## #   `175` <dbl>, `176` <dbl>, `177` <dbl>, `178` <dbl>, `179` <dbl>,
## #   `18` <dbl>, `180` <dbl>, `181` <dbl>, `182` <dbl>, `183` <dbl>,
## #   `184` <dbl>, `185` <dbl>, `186` <dbl>, `187` <dbl>, `188` <dbl>,
## #   `189` <dbl>, `19` <dbl>, `190` <dbl>, `191` <dbl>, `192` <dbl>,
## #   `193` <dbl>, `194` <dbl>, `195` <dbl>, `196` <dbl>, `197` <dbl>,
## #   `198` <dbl>, `199` <dbl>, ...

And now we need to make it narrow:

pearsonMatNarrow <- pearsonMat %>% 
  melt('user' )
pearsonMatNarrow %>% head()
##   user variable     value
## 1    1        1 0.0000000
## 2   10        1 0.5437268
## 3  100        1 0.5137099
## 4  101        1 0.6666667
## 5  102        1 0.6332355
## 6  103        1 0.5866514
simUser <- function(uid, df){
  top20 = df %>% 
    filter(user == uid & value != 1) %>%  # Necessary to remove 
                                          # perfectly correlated. 
    arrange(desc(value)) %>% 
    na.omit() %>% 
    head(20)
  return(top20)
}

simUser(1, pearsonMatNarrow)
##    user variable     value
## 1     1      685 0.9459029
## 2     1      195 0.9279968
## 3     1      567 0.9246919
## 4     1      193 0.9153194
## 5     1      129 0.9017475
## 6     1      765 0.8852599
## 7     1      459 0.8818540
## 8     1      535 0.8818540
## 9     1      885 0.8818540
## 10    1      558 0.8791842
## 11    1      821 0.8693042
## 12    1      810 0.8661675
## 13    1      613 0.8647987
## 14    1      720 0.8611349
## 15    1      557 0.8564751
## 16    1      199 0.8449490
## 17    1      283 0.8449490
## 18    1      352 0.8449490
## 19    1      927 0.8449490
## 20    1      153 0.8432354

This filters the relevant user and orders them by similarity.

mean20 <- function(userList, df, minRating = 4){
  top20 <- df[userList, ] %>% select(-user)
  top20Mat = as.matrix(as.data.frame(lapply(top20, as.numeric))) # R is weird.
  
  itemMeans <- colMeans(top20Mat, na.rm = TRUE)
  meanDF <- data_frame(names(top20), itemMeans)
  colnames(meanDF) <- c('itemId', 'rating')
  meanDF <- meanDF %>% 
    filter(rating > minRating) %>% #
    arrange(desc(rating)) # This must be
  return(meanDF)
}

test <- simUser(9, pearsonMatNarrow)
meanDF <- mean20(userList = test$variable, wideDF)
meanDF
## # A tibble: 205 x 2
##    itemId                                         rating
##    <chr>                                           <dbl>
##  1 Affair to Remember, An (1957)                       5
##  2 Apartment, The (1960)                               5
##  3 Big Sleep, The (1946)                               5
##  4 Blues Brothers 2000 (1998)                          5
##  5 Bottle Rocket (1996)                                5
##  6 Bread and Chocolate (Pane e cioccolata) (1973)      5
##  7 Celluloid Closet, The (1995)                        5
##  8 Close Shave, A (1995)                               5
##  9 Cool Hand Luke (1967)                               5
## 10 Deconstructing Harry (1997)                         5
## # ... with 195 more rows
numRatings <- function(mat){
  # Takes a sparse matrix and counts non-na values.
  temp <- sapply(mat, function(x) sum(is.na(x)))
  temp <- data_frame(names(temp), temp)
  names(temp) <- c('itemId', 'count')
  return(temp)
}
numRatingsList <- numRatings(select(wideDF, -user))
innerJoiner <- function(rankList, meanDF){
  rankList <- inner_join(rankList, meanDF, by = 'itemId' )
  rankList <- rankList %>% 
    arrange(desc(count)) %>% 
    head(10)
  return(rankList)
}
innerJoiner(numRatingsList, meanDF)
## # A tibble: 10 x 3
##    itemId                                         count rating
##    <chr>                                          <int>  <dbl>
##  1 Everest (1998)                                   941      5
##  2 Prefontaine (1997)                               940      5
##  3 Wedding Gift, The (1994)                         940      5
##  4 Margaret's Museum (1995)                         937      5
##  5 Year of the Horse (1997)                         936      5
##  6 Grace of My Heart (1996)                         935      5
##  7 Walking and Talking (1996)                       935      5
##  8 Stalker (1979)                                   932      5
##  9 Bread and Chocolate (Pane e cioccolata) (1973)   931      5
## 10 Primary Colors (1998)                            930      5
serRecs <- function(rankList, mat, uId){
  itemsToChange = rankList$itemId
  mat[uId, itemsToChange] = 5
  return(mat)
}
aggrigator <- function(uID, matToChange, wideDF, pearsonMatNarrow, rankList){
  simUserIDS <- simUser(uID, pearsonMatNarrow)
  avg20 <- mean20(userList = simUserIDS$variable, wideDF)
  topN <- innerJoiner(rankList,avg20)
  ret <- serRecs(topN, matToChange, uID)
  return(ret)
}

testMAt <- aggrigator(uID = 559, matToChange = svdPred, 
                      wideDF = wideDF,  
                      pearsonMatNarrow = pearsonMatNarrow, 
                      rankList = numRatingsList)

testMAt <- testMAt %>% melt() %>% as.data.table()
## Using user as id variables
testMAt[value==5]
##     user                                    variable value
##  1:  824                            Aparajito (1956)     5
##  2:  824               Great Day in Harlem, A (1994)     5
##  3:  824                               Harlem (1993)     5
##  4:  824                     Hearts and Minds (1996)     5
##  5:  824                     Murder, My Sweet (1944)     5
##  6:  824                      Pather Panchali (1955)     5
##  7:  824                          Prefontaine (1997)     5
##  8:  824 Two or Three Things I Know About Her (1966)     5
##  9:  824                Whole Wide World, The (1996)     5
## 10:  824      World of Apu, The (Apur Sansar) (1959)     5
svdPred1 <- svdPred
for (i in wideDF$user) {
  svdPred1 <- aggrigator(uID = i, 
             matToChange = svdPred1,  
             wideDF = wideDF,  
             pearsonMatNarrow = pearsonMatNarrow, 
             rankList = numRatingsList)
}

Deliverable 3

Compare and report on any change in accuracy before and after you’ve made the change in #2.

predDF = svdPred1 %>% melt() %>% 
  mutate(user = as.character(user))
## Using user as id variables
tempActual <- temp2 %>% mutate(user = as.character(user))

Get the real values of the test data:

actualTest <- as(getData(mM, "unknown") , "data.frame") 
actualTest <- actualTest %>% as.data.table()
prdDF <- as.data.table(predDF) 
# tempActual <- as.data.table(tempActual)
# names(tempActual)
names(prdDF)[2] <- 'item'
dtTest <- merge(prdDF, actualTest, by = c("user", "item"))
head(dtTest)
##    user                           item    value rating
## 1:  103      Fifth Element, The (1997) 3.787797      4
## 2:  103          Godfather, The (1972) 3.857578      4
## 3:  103  Independence Day (ID4) (1996) 3.558282      3
## 4:  103                    Jaws (1975) 3.770834      3
## 5:  103 Speed 2: Cruise Control (1997) 3.744836      1
## 6:  107    English Patient, The (1996) 3.120613      2

Here we have what we would recommend to user 1:

prdDF[value == 5 & user == 1] 
##     user                              item value
##  1:    1      Anne Frank Remembered (1995)     5
##  2:    1            Band Wagon, The (1953)     5
##  3:    1               Black Beauty (1994)     5
##  4:    1   Cry, the Beloved Country (1995)     5
##  5:    1                   Faithful (1996)     5
##  6:    1     Four Days in September (1997)     5
##  7:    1 Kama Sutra: A Tale of Love (1996)     5
##  8:    1            Oscar & Lucinda (1997)     5
##  9:    1         To Be or Not to Be (1942)     5
## 10:    1        Welcome To Sarajevo (1997)     5

and here we have the imputed values (on the test set) vs the actual ratings.

dtTest[value==5]
##    user                        item value rating
## 1:  179 Horse Whisperer, The (1998)     5      3
## 2:  254         Tom and Huck (1995)     5      3

This is so small because of the small test set and few users who rated these movies. This is a prime example of why serendipity doens’t work in this case but instead needs real world/real time data.

The accuracy measures for our test set are virtually identical given how long the test matrix is vs an RMSE of:

error = 1^2
sqrt(error/3)
## [1] 0.5773503

Which is small, but when added to the long matrix, it is inconsiquential.

Deliverable 4

As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.

Serendipitous recommendations will rarley increase accuracy. Instead, they offer a more interesting recommendation. We don’t expect our model to work better. Infact, the way to validate models like this is to instead see their effect on the business’s bottom line. For example, Netflix could use this algorithm to give serendipitous recommendations to x % of their users. If these users were more likely to cancel their subscription, if they actually finished watching the movie, if the made the purchase, etc, then it would be a good recommendation, even if it decreased the RMSE during the model build.

This is especially true because we are recommending movies that are not commonly seen. This means that it is unlikely that our sernedipitious recommendations are rated by the user, making cross validation difficult. In the real world, if we recommend a movie, a user will be much more likely to view/rate it, leading to real time validation.