The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches.

Content-based and Collaborative Filtering:

Recommenderlab offers different available datasets one of them is Movielens this gave me a good opportunity to follow up on the movie topic and Movielens constantly updates it’s datasets so went directly to the source and download the tiniest dataset with more answers available you can check them out in my Github.

Data Exploration:

After reading in the data locally, I review the head of the ratings variable that contains the core of the data used for this assignment.

# Data import from Github
ratings <- read.csv(paste0("https://raw.githubusercontent.com/sortega7878/DATA612/master/p2/ratings.csv"))
titles <- read.csv(paste0("https://raw.githubusercontent.com/sortega7878/DATA612/master/p2/movies.csv"))
                          

# Convert to matrix
mMatrix <- ratings %>% 
  select(-timestamp) %>% 
  spread(movieId, rating)

row.names(mMatrix) <- mMatrix[,1]
mMatrix <- mMatrix[-c(1)]
mMatrix <- as(as.matrix(mMatrix), "realRatingMatrix")

mMatrix
## 610 x 9724 rating matrix of class 'realRatingMatrix' with 100836 ratings.

Exploration and Preparation

The book recommends to go and explore the available data to see features that will enable or diminish the performance of the model.

Let us take a quick look at the distribution of all ratings to make sure that there are no surprises. It seems that users favor whole number ratings. 4 is the most common rating which seems to be common for 5-star rating systems.

vRatings <- as.vector(mMatrix@data)
vRatings <- vRatings[vRatings != 0]
ggplot() + aes(vRatings) + 
  geom_histogram(binwidth = 0.5) +
  xlab("Rating") + ylab("Number of Ratings")

Results show some sparcity I will pick ratings and movies with the most information to have better reliability.

movies <- mMatrix[rowCounts(mMatrix) > 50, colCounts(mMatrix) > 50]
movies
## 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.

We are left with 378 users and 436 items with better quality of information and a tinier dataset.There was a lot of movies with just a few ratings.

Based on the ratings matrix we will always encounter some bias regardles so I’ll show the distribution plot to see variation.

avg <- rowMeans(movies)
ggplot() + aes(avg) + 
  geom_histogram(binwidth = 0.1) +
  xlab("Average Rating") + ylab("Number of Ratings")

recommenderlab normalizes the data when building a model.However I’ll show the normalization of the data to se how the data transforms.

moviesNorm <- normalize(movies)
avg <- round(rowMeans(moviesNorm),5)
table(avg)
## avg
##   0 
## 378

We normalize the mean to 0 on 378 rows. And now lets take a look between non-normalized and normalized datasets.

minItems <- quantile(rowCounts(movies), 0.95)
minUsers <- quantile(colCounts(movies), 0.95)

image(movies[rowCounts(movies) > minItems, 
                 colCounts(movies) > minUsers], 
      main = "Heatmap of the Top Users and Movies (Non-Normalized")

Now let’s take a look on the normalized dataset

image(moviesNorm[rowCounts(moviesNorm) > minItems, 
                 colCounts(moviesNorm) > minUsers], 
      main = "Heatmap of the Top Users and Movies (Normalized)")

Reviewing rows in two heatmaps above, we can see that after normalization, the average rating is more uniform. Visually it does appear that bias is reduced.

Item-Item Collaborative Filtering

Let us split our set into training set 80% and testing set 20%.

set.seed(60)
wTrain <- sample(x = c(TRUE, FALSE), size = nrow(movies),
                      replace = TRUE, prob = c(0.8, 0.2))

movieTrain <- movies[wTrain, ]
movieTest <- movies[!wTrain, ]
movieTrain
## 297 x 436 rating matrix of class 'realRatingMatrix' with 29337 ratings.
movieTest
## 81 x 436 rating matrix of class 'realRatingMatrix' with 6877 ratings.

Training the Model

Now let us create a Item to Item model using the training set.

#Item to Item model
mIBCF <- Recommender(movieTrain, method = "IBCF" ,param=list(normalize = "Z-score",method="Jaccard")) 

We can examine the similarity matrix and find top ten movies that are similar to other movies.

simMat <- getModel(mIBCF)$sim
wMax <- order(colSums(simMat > 0), decreasing = TRUE)[1:10]
topMovies <- as.data.frame(as.integer(rownames(simMat)[wMax]))
colnames(topMovies) <- c("movieId")
data <- topMovies %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Movie
Harry Potter and the Half-Blood Prince (2009)
Juno (2007)
Inglourious Basterds (2009)
Up (2009)
District 9 (2009)
Sherlock Holmes (2009)
Toy Story 3 (2010)
Inception (2010)
Social Network, The (2010)
The Hunger Games (2012)

Making Recommendations Using Test Set

pred <- predict(mIBCF, newdata = movieTest, n = 6) 
pred
## Recommendations as 'topNList' with n = 6 for 81 users.

Now we can extract recommendations. Consider the first user.

# Movie ratings of the first user
user1 <- as.data.frame(movieTest@data[1,movieTest@data[1,]>0])
colnames(user1) <- c("Rating")
user1[c("movieId")] <- as.integer(rownames(user1))
data <- titles %>% 
  inner_join(user1, by = "movieId") %>% 
  select(Movie = "title", Rating) %>%
  arrange(desc(Rating))
knitr::kable(data, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Movie Rating
Brazil (1985) 5.0
Taxi Driver (1976) 4.5
Blade Runner (1982) 4.5
Fargo (1996) 4.5
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964) 4.5
Full Metal Jacket (1987) 4.5
Chinatown (1974) 4.5
Memento (2000) 4.5
Spirited Away (Sen to Chihiro no kamikakushi) (2001) 4.5
Dark Knight, The (2008) 4.5
Toy Story 3 (2010) 4.5
Dark Knight Rises, The (2012) 4.5
Usual Suspects, The (1995) 4.0
Léon: The Professional (a.k.a. The Professional) (Léon) (1994) 4.0
Shawshank Redemption, The (1994) 4.0
Schindler’s List (1993) 4.0
Reservoir Dogs (1992) 4.0
Monty Python and the Holy Grail (1975) 4.0
Wallace & Gromit: The Wrong Trousers (1993) 4.0
One Flew Over the Cuckoo’s Nest (1975) 4.0
Princess Bride, The (1987) 4.0
12 Angry Men (1957) 4.0
To Kill a Mockingbird (1962) 4.0
Apocalypse Now (1979) 4.0
Alien (1979) 4.0
Annie Hall (1977) 4.0
Graduate, The (1967) 4.0
Cool Hand Luke (1967) 4.0
Requiem for a Dream (2000) 4.0
Amelie (Fabuleux destin d’Amélie Poulain, Le) (2001) 4.0
Pan’s Labyrinth (Laberinto del fauno, El) (2006) 4.0
WALL·E (2008) 4.0
Up (2009) 4.0
Seven (a.k.a. Se7en) (1995) 3.5
Forrest Gump (1994) 3.5
Rear Window (1954) 3.5
Casablanca (1942) 3.5
Citizen Kane (1941) 3.5
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981) 3.5
Goodfellas (1990) 3.5
L.A. Confidential (1997) 3.5
Good Will Hunting (1997) 3.5
American History X (1998) 3.5
Matrix, The (1999) 3.5
Sixth Sense, The (1999) 3.5
American Beauty (1999) 3.5
Fight Club (1999) 3.5
Donnie Darko (2001) 3.5
Lord of the Rings: The Fellowship of the Ring, The (2001) 3.5
Lord of the Rings: The Two Towers, The (2002) 3.5
Lord of the Rings: The Return of the King, The (2003) 3.5
Eternal Sunshine of the Spotless Mind (2004) 3.5
Star Wars: Episode IV - A New Hope (1977) 3.0
Pulp Fiction (1994) 3.0
Silence of the Lambs, The (1991) 3.0
Star Wars: Episode V - The Empire Strikes Back (1980) 3.0
Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il) (1966) 3.0
Psycho (1960) 3.0
Snatch (2000) 3.0
Inception (2010) 3.0
Godfather, The (1972) 2.5

Recommendations are as follows:

# Recommendations for the first user
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Movie
Stargate (1994)
Ace Ventura: Pet Detective (1994)
Lion King, The (1994)
Speed (1994)
Jurassic Park (1993)
Home Alone (1990)

User-User Collaborative Filtering

The setup and code for the user based collaborative filtering is very similar to the item-based collaborative filtering above.

Training the Model

Now let us create a user-based model using the training set

(mUBCF <- Recommender(movieTrain, method = "UBCF",param=list(normalize = "Z-score",method="Jaccard",nn=5)))
## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 297 users.

Making Recommendations Using Test Set

( predUBCF <- predict(mUBCF, newdata = movieTest, n = 6) )
## Recommendations as 'topNList' with n = 6 for 81 users.

Let us consider the first user again and look at the recommendations.

# Recommendations for the first user
recommended <- predUBCF@itemLabels[predUBCF@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Movie
Fugitive, The (1993)
Casino Royale (2006)
Terminator, The (1984)
Back to the Future (1985)
Planet of the Apes (1968)
Jurassic Park (1993)

Normalization Test

Consider if we build the model without normalizing the data. Only one movie is featured on both lists of recommendations for the first user - Eternal Sunshine of the Spotless Mind. I would argue that without normalization recommendations include more usual suspects (movies very highly rated by majority of users) and therefore are more generic. This is, of course, highly subjective and needs to be researched and tested using more objective means than this writer’s opinion.

mUBCF <- Recommender(movieTrain, method = "UBCF", parameter = list(normalize = NULL))
predUBCF <- predict(mUBCF, newdata = movieTest, n = 6)
recommended <- predUBCF@itemLabels[predUBCF@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Movie
Toy Story (1995)
Apollo 13 (1995)
True Lies (1994)
Fugitive, The (1993)
Jurassic Park (1993)
Speed (1994)

Evaluation

xval <- evaluationScheme(as(movies, "realRatingMatrix"), 
                         method = "split", 
                         train = 0.7, 
                         given = 3,
                         goodRating = 5)

p1 <- predict(mIBCF, getData(xval, "known"), type = "ratings")
p2 <- predict(mUBCF, getData(xval, "known"), type = "ratings")

err_eval <- rbind(
  IBCF = calcPredictionAccuracy(p1, getData(xval, "unknown")),
  UBCF = calcPredictionAccuracy(p2, getData(xval, "unknown")))

err_eval
##          RMSE      MSE       MAE
## IBCF 1.311181 1.719194 0.9735028
## UBCF 2.888410 8.342910 2.7012794

Note that among the two methods we’re having abetter RMSE score in the ITEM-ITEM Recommender

sch <- evaluationScheme(as(movies, "realRatingMatrix"), 
                        method = "cross",
                        k = 4, # 4-fold cross validation scheme
                        given = 3,
                        goodRating=5)

# Next we use the created evaluation scheme to 
# evaluate the recommender method popular.
# We evaluate top-1, top-3, top-5, top-10, 
# top-15, and top-20 recommendation lists. 

# this will tell us how long it takes our 
# rec to serve up n recs:
results <- evaluate(sch, 
                    method = "IBCF", 
                    type = "topNList",
                    n = c(1, 3, 5, 10, 15, 20))
## IBCF run fold/sample [model time/prediction time]
##   1  [0.69sec/0.05sec] 
##   2  [0.81sec/0.02sec] 
##   3  [1.04sec/0.11sec] 
##   4  [0.77sec/0.06sec]
results2 <- evaluate(sch, 
                    method = "UBCF", 
                    type = "topNList",
                    n = c(1, 3, 5, 10, 15, 20))
## UBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/0.24sec] 
##   2  [0sec/0.48sec] 
##   3  [0.02sec/0.44sec] 
##   4  [0.02sec/0.5sec]
getConfusionMatrix(results)[[1]]
##            TP         FP       FN       TN  precision      recall
## 1  0.02083333  0.8645833 17.65625 414.4583 0.02352941 0.001138320
## 3  0.10416667  2.5520833 17.57292 412.7708 0.03921569 0.004861445
## 5  0.16666667  4.2604167 17.51042 411.0625 0.03764706 0.008065325
## 10 0.38541667  8.4687500 17.29167 406.8542 0.04352941 0.023861859
## 15 0.56250000 12.6979167 17.11458 402.6250 0.04247360 0.035404768
## 20 0.75000000 16.8020833 16.92708 398.5208 0.04266968 0.048452398
##            TPR         FPR
## 1  0.001138320 0.002087706
## 3  0.004861445 0.006153853
## 5  0.008065325 0.010276550
## 10 0.023861859 0.020393640
## 15 0.035404768 0.030584597
## 20 0.048452398 0.040494314
plot(results, "prec/rec", annotate=TRUE)

getConfusionMatrix(results2)[[1]]
##           TP        FP       FN       TN precision     recall        TPR
## 1  0.2291667  0.656250 17.44792 414.6667 0.2588235 0.01356543 0.01356543
## 3  0.5625000  2.093750 17.11458 413.2292 0.2117647 0.04566944 0.04566944
## 5  0.8125000  3.614583 16.86458 411.7083 0.1835294 0.05871916 0.05871916
## 10 1.3229167  7.531250 16.35417 407.7917 0.1494118 0.09231985 0.09231985
## 15 1.8333333 11.447917 15.84375 403.8750 0.1380392 0.11983825 0.11983825
## 20 2.2916667 15.416667 15.38542 399.9062 0.1294118 0.14044068 0.14044068
##            FPR
## 1  0.001563910
## 3  0.005010628
## 5  0.008645852
## 10 0.018035956
## 15 0.027437936
## 20 0.036951540
plot(results2, "prec/rec", annotate=TRUE)

Summary

This project presented the most basic approach to building a recommender system in R using the recommenderlab package. It describes some basic steps; however, it is important to note that this is a learning exercise. More development and testing would be needed for a usable recommender system. It would be particularly interesting to see the effect of additional features and to see the performance of the recommender when little information is known about user’s preference.