Find an interesting dataset and describe the system you plan to build out. If you would like to use one of the datasets you have already worked with, you should add a unique element or incorporate additional data. (i.e. explicit features you scrape from another source, like image analysis on movie posters). The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered.

#include appropriate packages
library(ggplot2)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.0.1     ✔ purrr   0.3.0
## ✔ tidyr   0.8.3     ✔ dplyr   0.7.8
## ✔ readr   1.3.1     ✔ stringr 1.3.1
## ✔ tibble  2.0.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(kableExtra)
library(knitr)
library(recommenderlab)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## Loading required package: arules
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
## Loading required package: proxy
## 
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
## 
##     as.matrix
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## The following object is masked from 'package:base':
## 
##     as.matrix
## Loading required package: registry
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(dplyr)
library(stringr)
library(Matrix)

Data Set Up

I will be using the Book-Crossing dataset [1]. This dataset contains book ratings (on a scale of 0-10), user ID’s (along with user age and user location), and book ID’s (along with author, title, and year published).

#import data
users <- as.data.frame(read.csv("BX-Users.csv"), stringsAsFactors = FALSE, fileEncoding="latin1", skip = "16")
books <- as.data.frame(read.csv("BX-Books.csv"), stringsAsFactors = FALSE, fileEncoding="latin1", skip = "16")
ratings <- as.data.frame(read.csv("BX-Book-Ratings.csv"), stringsAsFactors = FALSE, fileEncoding="latin1", skip = "16")

#clean up data
users$User.ID.Location.Age <- str_remove(users$User.ID.Location.Age, "[<>]")
u <- str_split_fixed(users$User.ID.Location.Age, "\\;+", 3)
colnames(u) <- c("userid", "location", "age")

b <- str_split_fixed(books$ISBN.Book.Title.Book.Author.Year.Of.Publication.Publisher.Image.URL.S.Image.URL.M.Image.URL.L, "\\;+", 9)
colnames(b) <- c("isbn", "title", "author", "year", "publisher", "image urls", "image urlm", "image urll", "image urln")

r <- str_split_fixed(ratings$User.ID.ISBN.Book.Rating, "\\;+", 3)
colnames(r) <- c("userid", "isbn", "rating")

#merge matrices
bookratingsfull <- merge(u, r, by = "userid")
bookratingsfull <- merge(bookratingsfull, b, by = "isbn")
head(bookratingsfull)
##         isbn userid                              location  age rating
## 1 000160418X 224654       watton, england, united kingdom NULL      0
## 2 0001837397 225986      watford, england, united kingdom   36      0
## 3 000195833X 243930  montreuil, seine saint-denis, france   43      0
## 4 0001981307 166188       london, england, united kingdom   43      0
## 5 0001981323 244602 chesterfield, england, united kingdom NULL     10
## 6 0002005018      8              timmins, ontario, canada NULL      5
##                                   title               author year
## 1        The Clue in the Crumbling Wall        Carolyn Keene 1984
## 2            Autumn Story Brambly Hedge         Jill Barklem    0
## 3 The Earth and the Sky (Tell Me About)       Pierre Averous 1982
## 4       The Cross Rabbit (Percy's Park)     Nick Butterworth 1994
## 5      The Badger's Bath (Percy's Park)     Nick Butterworth 1996
## 6                          Clara Callan Richard Bruce Wright 2001
##                     publisher
## 1    HarperCollins Publishers
## 2 William Collins Sons Co Ltd
## 3    HarperCollins Publishers
## 4    HarperCollins Publishers
## 5    HarperCollins Publishers
## 6       HarperFlamingo Canada
##                                                     image urls
## 1 http://images.amazon.com/images/P/000160418X.01.THUMBZZZ.jpg
## 2 http://images.amazon.com/images/P/0001837397.01.THUMBZZZ.jpg
## 3 http://images.amazon.com/images/P/000195833X.01.THUMBZZZ.jpg
## 4 http://images.amazon.com/images/P/0001981307.01.THUMBZZZ.jpg
## 5 http://images.amazon.com/images/P/0001981323.01.THUMBZZZ.jpg
## 6 http://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg
##                                                     image urlm
## 1 http://images.amazon.com/images/P/000160418X.01.MZZZZZZZ.jpg
## 2 http://images.amazon.com/images/P/0001837397.01.MZZZZZZZ.jpg
## 3 http://images.amazon.com/images/P/000195833X.01.MZZZZZZZ.jpg
## 4 http://images.amazon.com/images/P/0001981307.01.MZZZZZZZ.jpg
## 5 http://images.amazon.com/images/P/0001981323.01.MZZZZZZZ.jpg
## 6 http://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg
##                                                     image urll image urln
## 1 http://images.amazon.com/images/P/000160418X.01.LZZZZZZZ.jpg           
## 2 http://images.amazon.com/images/P/0001837397.01.LZZZZZZZ.jpg           
## 3 http://images.amazon.com/images/P/000195833X.01.LZZZZZZZ.jpg           
## 4 http://images.amazon.com/images/P/0001981307.01.LZZZZZZZ.jpg           
## 5 http://images.amazon.com/images/P/0001981323.01.LZZZZZZZ.jpg           
## 6 http://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg
bookratingsfull$location <- as.character(bookratingsfull$location)

#remove image url columns and location column for easier analysis
bookratingsfull <- bookratingsfull[,-c(10:13)]
bookratingsfull <- bookratingsfull[,-3]

#correct class
bookrm <- as.matrix(bookratingsfull)
bookrm <- as(bookrm, "dgCMatrix")
## Warning in storage.mode(from) <- "double": NAs introduced by coercion
bookrm <- as(bookrm, "realRatingMatrix")

#normalize data
booknorm <- normalize(bookrm, method="center", row=TRUE)

Project Idea

I will be making a recommender system for the books. I will be doing the following steps: ## Split data into training and test sets

#create train and test set
train <- bookrm[1:100,]
test <- bookrm[200:204,]

Implement the recommender system to the dataset

scheme <- evaluationScheme(bookrm, method="split", 
                           train=0.8, 
                           given= 1, 
                           goodRating=8)
set.seed(1444)


evaluation_models <-list(POP = list(name='POPULAR', 
                         param = list(
                            normalize = 'z-score')),
                         RAN = list(name='RANDOM', 
                         param = list( 
                           normalize = 'center')))

list_results <- evaluate(x =scheme, 
                    method = evaluation_models,
                    n = seq(10, 11))
## POPULAR run fold/sample [model time/prediction time]
##   1  [7.339sec/68.897sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.009sec/5.499sec]
average_matrices <- lapply(list_results, avg)
average_matrices
## $POP
##          TP        FP FN TN precision recall TPR FPR
## 10 6.223735 0.7762647  0  0  0.889105      1   1   1
## 11 6.223735 0.7762647  0  0  0.889105      1   1   1
## 
## $RAN
##          TP        FP FN TN precision recall TPR FPR
## 10 6.223735 0.7762647  0  0  0.889105      1   1   1
## 11 6.223735 0.7762647  0  0  0.889105      1   1   1

We can see that the precision is pretty high for both the popular model and the random model, as is the true positive rate. Now, I will create a hybrid recommender system. This will introduce serendipitous suggestions to the users, hopefully suggesting books they haven’t heard of and they will like.

## mix popular books with a random recommendations for diversity and

recom <- HybridRecommender(
Recommender(train, method = "POPULAR"),
Recommender(train, method = "RANDOM"),
weights = c(.6, .4)
)

recom
## Recommender of type 'HYBRID' for 'ratingMatrix' 
## learned using NA users.
getModel(recom)
## $recommender
## $recommender[[1]]
## Recommender of type 'POPULAR' for 'realRatingMatrix' 
## learned using 100 users.
## 
## $recommender[[2]]
## Recommender of type 'RANDOM' for 'realRatingMatrix' 
## learned using 100 users.
## 
## 
## $weights
## [1] 0.6 0.4
pred <- predict(recom, test)
predict <- as(predict(recom, test), "list")
predict
## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "rating"
## 
## [[4]]
## [1] "rating"
## 
## [[5]]
## [1] "rating"

Compare accuracy across datasets and algorithms

## check error
err <-calcPredictionAccuracy(pred, test, given = 50, goodRating = "7")
err
##           TP           FP           FN           TN    precision 
##   0.00000000   0.60000000   3.80000000 -46.40000000   0.00000000 
##       recall          TPR          FPR 
##   0.00000000   0.00000000  -0.01304759

We can see that the precision is now 0 and the TP rate is 0 as well. This is not a good sign and signifies that combining the methods weakened the model.

Acknowledgements: [1] Improving Recommendation Lists Through Topic Diversification, Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW ’05), May 10-14, 2005, Chiba, Japan. To appear.