Both authors, Jaan and Kai, have worked on Movielens data before, however, not as a group and Kai used Python while Jaan used R. As Kai was transitioning from working in Python, the built-in MovieLens data provided an excellent way to become aquainted for the final project.
The main reason we chose to use this is because the dataset we plan on using for our final project may be too large for recommenderlab
- we may seek additional libraries to leverage distributed computing.
Compare accuracy of two recommender systems.
library(tidyverse)
library(recommenderlab)
library(reshape2)
library(data.table) # Faster
We are just going to use regular data contained in the MovieLense set.
data(MovieLense)
image(MovieLense, main = "Heatmap of the rating matrix")
Here we can see that the built-in MovieLense
it is very sparse with many more movies than users.
First we split the data using the evaluationScheme
method.
# Use eval scheme to split the data 70/30 Train/Test
eval_create <- proc.time()
mM <- evaluationScheme(MovieLense,
method = "split",
train = 0.7,
given = -5, # 10 recommendations per users
goodRating = 5)
eval_create <- proc.time() - eval_create
As shown in the below figure, the similarity of users is denoted by the darkness of the cell color (white to yellow to orange to red). We see that users share a varying amount of similarity depending on the pairing according to the pearson method.
# Review the similarity between the users
s_users <- similarity(mM@data[1:20,],
method = "pearson",
which = "users")
image(as.matrix(s_users), main = "User Similarity, First 20 User Pairs")
Taking another look at the data, we review 4 histograms (see below) and provide some observations:
par(mfrow=c(2,2))
hist(rowCounts(mM@data), breaks=50, main = "Review Qty per Movie")
hist(colMeans(mM@data), breaks=50, main = "Mean Review per Movie")
hist(getRatings(mM@data), breaks=100, main = "Hist of Ratings, Original")
hist(getRatings(normalize(mM@data)), breaks=100,
main = "Hist of Ratings, Normalized")
Below, we create 4 recommenders: IBCF
, UBCF
, SVD
, and POPULAR
.
# Build Recommenders to Compare
rm_ibcf <- Recommender(data = getData(mM, "train"),
method = "IBCF")
rm_ubcf <- Recommender(data = getData(mM, "train"),
method = "UBCF")
rm_svd <- Recommender(data = getData(mM, "train"),
method = "SVD")
rm_popu <- Recommender(data = getData(mM, "train"),
method = "POPULAR")
# rm_ubcf2<- Recommender(data = getData(mM, "train"),
# method = "UBCF",
# parameter = list(method = c("pearson", nn = 20)))
# rm_ubcf2@model$data@data
In the recommenderlab
package, prediction can be used to return either top-N
lists or predicted ratings for all user-item combinations using ratings
for type
.
p1 <- predict(rm_ibcf, getData(mM, "known"), type = "ratings")
p2 <- predict(rm_ubcf, getData(mM, "known"), type = "ratings")
p3 <- predict(rm_svd, getData(mM, "known"), type = "ratings")
p4 <- predict(rm_popu, getData(mM, "known"), type = "ratings")
# mM@data # 943 x 1664 rating matrix of class
# #‘realRatingMatrix’ with 99392 ratings.
# getData(mM, "known") # 283 x 1664 rating matrix of class
# #‘realRatingMatrix’ with 849 ratings.
# getData(mM, "unknown")# 283 x 1664 rating matrix of class
# #‘realRatingMatrix’ with 30758 ratings.
err_eval <- rbind(
IBCF = calcPredictionAccuracy(p1, getData(mM, "unknown")),
UBCF = calcPredictionAccuracy(p2, getData(mM, "unknown")),
SVD = calcPredictionAccuracy(p3, getData(mM, "unknown")),
POPU = calcPredictionAccuracy(p4, getData(mM, "unknown")))
err_eval
## RMSE MSE MAE
## IBCF 1.4538081 2.1135579 1.1236407
## UBCF 1.0616733 1.1271503 0.8564724
## SVD 1.0398814 1.0813533 0.8348754
## POPU 0.9989997 0.9980005 0.7913465
These three methods are very close in their error values. Further, an RMSE of \(\approx 1.2\) for SVD
isn’t that bad considering that we didn’t normalize the ratings before building.
For basline cross-validation of recommendations, we use the cross
method for SVD
, UBCF
, and POPU
recs_qtys <- c(1,3,5,7,9,11,13,15)
# Use eval scheme for cross-validation
cM <- evaluationScheme(MovieLense, # k = 10 default
method = "split",
train = .7,
k = 5,
given = -5, # all but 5 recommendations per users
goodRating = 5)
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/3.96sec]
## 2 [0.01sec/3.89sec]
## 3 [0sec/3.87sec]
## 4 [0.02sec/3.66sec]
## 5 [0.02sec/3.57sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0.01sec/1.64sec]
## 2 [0.01sec/1.69sec]
## 3 [0sec/1.7sec]
## 4 [0sec/1.63sec]
## 5 [0.02sec/1.56sec]
## SVD run fold/sample [model time/prediction time]
## 1 [0.73sec/0.43sec]
## 2 [0.73sec/0.68sec]
## 3 [0.72sec/0.41sec]
## 4 [0.82sec/0.42sec]
## 5 [0.73sec/0.42sec]
## List of evaluation results for 3 recommenders:
## Evaluation results for 5 folds/samples using method 'UBCF'.
## Evaluation results for 5 folds/samples using method 'POPULAR'.
## Evaluation results for 5 folds/samples using method 'SVD'.
Below, we plot the ROC curves and the Precision/Recall curves for the 3 different RS that we created. Units of measure along the TPR (true-postive rate) and FPR (false-poative rate) axes are very small - one can see that all 3 produce similar performance for Top-5 recs.
par(mfrow=c(1,2))
plot(rr, legend = "topleft")
plot(rr, "prec/rec", legend="topright")
Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity
As previous students have mentioned in our presentations, there can be more to a busines model than strait accuracy.
It can be delightful to find a movie on Netflix that is a bit outside of your regular style of movie but still surprises and excites. Generally, this involves taking some risks as to the rating.
We believe that users are naturally curious and do not always follow recommendations. They instead go to a general list of movies and pick one, not at random, but still one that the recommender might think would not appeal to them. This introduces natural experiments where a user decides if they like a film.
Now, there are many methods to achieve this, and any answer for which one is better would have to be validated with A/B testing on various KPIs relevant to the business. (More on this later.)
We believe that finding new movies that are less common than the ones recommended to users is a relevant business idea. To achieve this, we took the top 20 most similar users, based on Pearson Similarity, and took their highest rated movies. Because these users are similar, many movies rated 4-5 stars will be predicted 4-5 stars too. Instead, we take movies that are not common and recomend those. Some of these will not appeal to a certain user, but we think that novel, “hipster”, movies could add business value.
Our algorithm was slightly complicated to implement in R but the functions below work by:
This will be the basis for our serendipitous recomender.
similarity_users <- similarity(MovieLense,
method = "pearson",
which = "users")
Finally, the predictions for all results from the SVD algorithm are given below.
pr1 <- predict(rm_svd, getData(mM, "train"), type = "ratingMatrix")
pr2 <- predict(rm_svd, getData(mM, "known"), type = "ratingMatrix")
temp1 <- as(pr1, 'data.frame')
temp2 <- as(pr2, 'data.frame')
svdPreds:
svdPred <- rbind(temp1, temp2)
dim(svdPred)
## [1] 1569152 3
svdPred %>% head()
## user item rating
## 1 723 Toy Story (1995) -0.2727273
## 661 723 GoldenEye (1995) 3.2949177
## 1321 723 Four Rooms (1995) 3.2541443
## 1981 723 Get Shorty (1995) 3.2822212
## 2641 723 Copycat (1995) 3.2531762
## 3301 723 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 3.2657702
svdPred <- svdPred %>% spread(key = item, value = rating)
dim(svdPred)
## [1] 943 1665
We need to make the dataset wide to more easily add value in.
narrowDF <- as(MovieLense, 'data.frame')
narrowDF %>% head()
## user item rating
## 1 1 Toy Story (1995) 5
## 453 1 GoldenEye (1995) 3
## 584 1 Four Rooms (1995) 4
## 674 1 Get Shorty (1995) 3
## 883 1 Copycat (1995) 3
## 969 1 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 5
wideDF <- spread(data = narrowDF, key = item, value = rating)
wideDF[1:10, 1:10]
## user 'Til There Was You (1997) 1-900 (1994) 101 Dalmatians (1996)
## 1 1 NA NA 2
## 2 10 NA NA NA
## 3 100 NA NA NA
## 4 101 NA NA 3
## 5 102 NA NA NA
## 6 103 NA NA NA
## 7 104 NA NA NA
## 8 105 NA NA NA
## 9 106 NA NA NA
## 10 107 NA NA NA
## 12 Angry Men (1957) 187 (1997) 2 Days in the Valley (1996)
## 1 5 NA NA
## 2 5 NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## 7 NA 1 3
## 8 NA NA NA
## 9 NA NA NA
## 10 NA NA NA
## 20,000 Leagues Under the Sea (1954) 2001: A Space Odyssey (1968)
## 1 3 4
## 2 NA 5
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## 7 NA NA
## 8 NA NA
## 9 NA NA
## 10 NA NA
## 3 Ninjas: High Noon At Mega Mountain (1998)
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## 7 NA
## 8 NA
## 9 NA
## 10 NA
We need to get the userID and Movie Name for these ratings so that we can index them for our predictions.
userID <- wideDF$user
movieNames <- names(wideDF)[2:dim(wideDF)[2]]
Here is an example of the matrix. There are some NA values etc. but we can just ignore those as we have enough similar users.
pearsonMat = as.matrix(similarity_users) %>%
as_data_frame()
names(pearsonMat) = userID
pearsonMat$user <- userID
pearsonMat %>% head()
## # A tibble: 6 x 944
## `1` `10` `100` `101` `102` `103` `104` `105` `106` `107` `108`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.544 0.514 0.667 0.633 0.587 0.577 0.765 0.5 0.5 0.583
## 2 0.544 0 0.5 0.532 0.598 0.644 0.737 0.707 0.569 0.751 0.532
## 3 0.514 0.5 0 0.5 NA 0.5 0.519 0.585 NA 0.592 0.5
## 4 0.667 0.532 0.5 0 1 0.5 0.5 0.737 NA 0.5 0.5
## 5 0.633 0.598 NA 1 0 0.569 0.548 0.684 0.703 0.523 0.548
## 6 0.587 0.644 0.5 0.5 0.569 0 0.565 0.762 0.535 0.579 0.5
## # ... with 933 more variables: `109` <dbl>, `11` <dbl>, `110` <dbl>,
## # `111` <dbl>, `112` <dbl>, `113` <dbl>, `114` <dbl>, `115` <dbl>,
## # `116` <dbl>, `117` <dbl>, `118` <dbl>, `119` <dbl>, `12` <dbl>,
## # `120` <dbl>, `121` <dbl>, `122` <dbl>, `123` <dbl>, `124` <dbl>,
## # `125` <dbl>, `126` <dbl>, `127` <dbl>, `128` <dbl>, `129` <dbl>,
## # `13` <dbl>, `130` <dbl>, `131` <dbl>, `132` <dbl>, `133` <dbl>,
## # `134` <dbl>, `135` <dbl>, `136` <dbl>, `137` <dbl>, `138` <dbl>,
## # `139` <dbl>, `14` <dbl>, `140` <dbl>, `141` <dbl>, `142` <dbl>,
## # `143` <dbl>, `144` <dbl>, `145` <dbl>, `146` <dbl>, `147` <dbl>,
## # `148` <dbl>, `149` <dbl>, `15` <dbl>, `150` <dbl>, `151` <dbl>,
## # `152` <dbl>, `153` <dbl>, `154` <dbl>, `155` <dbl>, `156` <dbl>,
## # `157` <dbl>, `158` <dbl>, `159` <dbl>, `16` <dbl>, `160` <dbl>,
## # `161` <dbl>, `162` <dbl>, `163` <dbl>, `164` <dbl>, `165` <dbl>,
## # `166` <dbl>, `167` <dbl>, `168` <dbl>, `169` <dbl>, `17` <dbl>,
## # `170` <dbl>, `171` <dbl>, `172` <dbl>, `173` <dbl>, `174` <dbl>,
## # `175` <dbl>, `176` <dbl>, `177` <dbl>, `178` <dbl>, `179` <dbl>,
## # `18` <dbl>, `180` <dbl>, `181` <dbl>, `182` <dbl>, `183` <dbl>,
## # `184` <dbl>, `185` <dbl>, `186` <dbl>, `187` <dbl>, `188` <dbl>,
## # `189` <dbl>, `19` <dbl>, `190` <dbl>, `191` <dbl>, `192` <dbl>,
## # `193` <dbl>, `194` <dbl>, `195` <dbl>, `196` <dbl>, `197` <dbl>,
## # `198` <dbl>, `199` <dbl>, ...
And now we need to make it narrow:
pearsonMatNarrow <- pearsonMat %>%
melt('user' )
pearsonMatNarrow %>% head()
## user variable value
## 1 1 1 0.0000000
## 2 10 1 0.5437268
## 3 100 1 0.5137099
## 4 101 1 0.6666667
## 5 102 1 0.6332355
## 6 103 1 0.5866514
simUser <- function(uid, df){
top20 = df %>%
filter(user == uid & value != 1) %>% # Necessary to remove
# perfectly correlated.
arrange(desc(value)) %>%
na.omit() %>%
head(20)
return(top20)
}
simUser(1, pearsonMatNarrow)
## user variable value
## 1 1 685 0.9459029
## 2 1 195 0.9279968
## 3 1 567 0.9246919
## 4 1 193 0.9153194
## 5 1 129 0.9017475
## 6 1 765 0.8852599
## 7 1 459 0.8818540
## 8 1 535 0.8818540
## 9 1 885 0.8818540
## 10 1 558 0.8791842
## 11 1 821 0.8693042
## 12 1 810 0.8661675
## 13 1 613 0.8647987
## 14 1 720 0.8611349
## 15 1 557 0.8564751
## 16 1 199 0.8449490
## 17 1 283 0.8449490
## 18 1 352 0.8449490
## 19 1 927 0.8449490
## 20 1 153 0.8432354
This filters the relevant user and orders them by similarity.
mean20 <- function(userList, df, minRating = 4){
top20 <- df[userList, ] %>% select(-user)
top20Mat = as.matrix(as.data.frame(lapply(top20, as.numeric))) # R is weird.
itemMeans <- colMeans(top20Mat, na.rm = TRUE)
meanDF <- data_frame(names(top20), itemMeans)
colnames(meanDF) <- c('itemId', 'rating')
meanDF <- meanDF %>%
filter(rating > minRating) %>% #
arrange(desc(rating)) # This must be
return(meanDF)
}
test <- simUser(9, pearsonMatNarrow)
meanDF <- mean20(userList = test$variable, wideDF)
meanDF
## # A tibble: 205 x 2
## itemId rating
## <chr> <dbl>
## 1 Affair to Remember, An (1957) 5
## 2 Apartment, The (1960) 5
## 3 Big Sleep, The (1946) 5
## 4 Blues Brothers 2000 (1998) 5
## 5 Bottle Rocket (1996) 5
## 6 Bread and Chocolate (Pane e cioccolata) (1973) 5
## 7 Celluloid Closet, The (1995) 5
## 8 Close Shave, A (1995) 5
## 9 Cool Hand Luke (1967) 5
## 10 Deconstructing Harry (1997) 5
## # ... with 195 more rows
numRatings <- function(mat){
# Takes a sparse matrix and counts non-na values.
temp <- sapply(mat, function(x) sum(is.na(x)))
temp <- data_frame(names(temp), temp)
names(temp) <- c('itemId', 'count')
return(temp)
}
numRatingsList <- numRatings(select(wideDF, -user))
innerJoiner <- function(rankList, meanDF){
rankList <- inner_join(rankList, meanDF, by = 'itemId' )
rankList <- rankList %>%
arrange(desc(count)) %>%
head(10)
return(rankList)
}
innerJoiner(numRatingsList, meanDF)
## # A tibble: 10 x 3
## itemId count rating
## <chr> <int> <dbl>
## 1 Everest (1998) 941 5
## 2 Prefontaine (1997) 940 5
## 3 Wedding Gift, The (1994) 940 5
## 4 Margaret's Museum (1995) 937 5
## 5 Year of the Horse (1997) 936 5
## 6 Grace of My Heart (1996) 935 5
## 7 Walking and Talking (1996) 935 5
## 8 Stalker (1979) 932 5
## 9 Bread and Chocolate (Pane e cioccolata) (1973) 931 5
## 10 Primary Colors (1998) 930 5
serRecs <- function(rankList, mat, uId){
itemsToChange = rankList$itemId
mat[uId, itemsToChange] = 5
return(mat)
}
aggrigator <- function(uID, matToChange, wideDF, pearsonMatNarrow, rankList){
simUserIDS <- simUser(uID, pearsonMatNarrow)
avg20 <- mean20(userList = simUserIDS$variable, wideDF)
topN <- innerJoiner(rankList,avg20)
ret <- serRecs(topN, matToChange, uID)
return(ret)
}
testMAt <- aggrigator(uID = 559, matToChange = svdPred,
wideDF = wideDF,
pearsonMatNarrow = pearsonMatNarrow,
rankList = numRatingsList)
testMAt <- testMAt %>% melt() %>% as.data.table()
## Using user as id variables
testMAt[value==5]
## user variable value
## 1: 824 Aparajito (1956) 5
## 2: 824 Great Day in Harlem, A (1994) 5
## 3: 824 Harlem (1993) 5
## 4: 824 Hearts and Minds (1996) 5
## 5: 824 Murder, My Sweet (1944) 5
## 6: 824 Pather Panchali (1955) 5
## 7: 824 Prefontaine (1997) 5
## 8: 824 Two or Three Things I Know About Her (1966) 5
## 9: 824 Whole Wide World, The (1996) 5
## 10: 824 World of Apu, The (Apur Sansar) (1959) 5
svdPred1 <- svdPred
for (i in wideDF$user) {
svdPred1 <- aggrigator(uID = i,
matToChange = svdPred1,
wideDF = wideDF,
pearsonMatNarrow = pearsonMatNarrow,
rankList = numRatingsList)
}
Compare and report on any change in accuracy before and after you’ve made the change in #2.
predDF = svdPred1 %>% melt() %>%
mutate(user = as.character(user))
## Using user as id variables
tempActual <- temp2 %>% mutate(user = as.character(user))
Get the real values of the test data:
actualTest <- as(getData(mM, "unknown") , "data.frame")
actualTest <- actualTest %>% as.data.table()
prdDF <- as.data.table(predDF)
# tempActual <- as.data.table(tempActual)
# names(tempActual)
names(prdDF)[2] <- 'item'
dtTest <- merge(prdDF, actualTest, by = c("user", "item"))
head(dtTest)
## user item value rating
## 1: 103 Fifth Element, The (1997) 3.787797 4
## 2: 103 Godfather, The (1972) 3.857578 4
## 3: 103 Independence Day (ID4) (1996) 3.558282 3
## 4: 103 Jaws (1975) 3.770834 3
## 5: 103 Speed 2: Cruise Control (1997) 3.744836 1
## 6: 107 English Patient, The (1996) 3.120613 2
Here we have what we would recommend to user 1:
prdDF[value == 5 & user == 1]
## user item value
## 1: 1 Anne Frank Remembered (1995) 5
## 2: 1 Band Wagon, The (1953) 5
## 3: 1 Black Beauty (1994) 5
## 4: 1 Cry, the Beloved Country (1995) 5
## 5: 1 Faithful (1996) 5
## 6: 1 Four Days in September (1997) 5
## 7: 1 Kama Sutra: A Tale of Love (1996) 5
## 8: 1 Oscar & Lucinda (1997) 5
## 9: 1 To Be or Not to Be (1942) 5
## 10: 1 Welcome To Sarajevo (1997) 5
and here we have the imputed values (on the test set) vs the actual ratings.
dtTest[value==5]
## user item value rating
## 1: 179 Horse Whisperer, The (1998) 5 3
## 2: 254 Tom and Huck (1995) 5 3
This is so small because of the small test set and few users who rated these movies. This is a prime example of why serendipity doens’t work in this case but instead needs real world/real time data.
The accuracy measures for our test set are virtually identical given how long the test matrix is vs an RMSE of:
error = 1^2
sqrt(error/3)
## [1] 0.5773503
Which is small, but when added to the long matrix, it is inconsiquential.
As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.
Serendipitous recommendations will rarley increase accuracy. Instead, they offer a more interesting recommendation. We don’t expect our model to work better. Infact, the way to validate models like this is to instead see their effect on the business’s bottom line. For example, Netflix could use this algorithm to give serendipitous recommendations to x % of their users. If these users were more likely to cancel their subscription, if they actually finished watching the movie, if the made the purchase, etc, then it would be a good recommendation, even if it decreased the RMSE during the model build.
This is especially true because we are recommending movies that are not commonly seen. This means that it is unlikely that our sernedipitious recommendations are rated by the user, making cross validation difficult. In the real world, if we recommend a movie, a user will be much more likely to view/rate it, leading to real time validation.