In this project I am going to implement user-recommendation algorithms for Fantasy book ratings dataset I have collected from 18 people. The dataset is small so it is more of an example and it will be interesting to see how accurate of a recommendation I will be able to produce with that limitation. In this assignment I will explore a few tecniques for creating user recommendations.

1. Installing packages and loading our data:

options(warn=-1)
if(!"recommenderlab" %in% rownames(installed.packages())){
install.packages("recommenderlab")}
suppressMessages(library("ggplot2"))

suppressMessages(library("recommenderlab"))
suppressMessages(library(kableExtra))
df <- read.csv(file="https://raw.githubusercontent.com/che10vek/Data612/master/FantasyBookRatings18.csv", header=TRUE, stringsAsFactors = FALSE, sep=",")
head(df) %>% kable(caption = "Fantasy Book Ratings") %>% kable_styling("striped", full_width = TRUE)
Fantasy Book Ratings
X Wizard.s.First.Rule Harry.Potter..Series. Twilight..Series. Hitchhiker.s.Guide.to.the.Galaxy Jonathan.Strange.and.Mister.Norrell Master.and.Margarita Lord.of.the.Rings Song.of.Ice.and.Fire Hunger.Games
Lina 5 4 5 3 NA 4 NA NA 3
Lilya 4 5 5 4 5 3 NA 4 3
Mariya 4 NA 5 NA 2 5 2 NA 3
Yuliya NA 5 NA 4 2 5 3 1 3
Irene 4 5 4 4 3 4 NA 5 3
Alla 5 NA 5 4 NA 5 NA NA 4
df <- df[1:17,]

2. Investigating the data

Let’s inspect our data. We can see that even though the dataset is small - it has good data so there is no need to clean up anything besides the presence of NAs.

summary(df)%>% kable(caption = "Summary") %>% kable_styling("striped", full_width = TRUE)
Summary
  X </th>
Wizard.s.First.Rule Harry.Potter..Series. Twilight..Series. Hitchhiker.s.Guide.to.the.Galaxy Jonathan.Strange.and.Mister.Norrell Master.and.Margarita Lord.of.the.Rings Song.of.Ice.and.Fire Hunger.Games
Length:17 Min. :4.0 Min. :4.000 Min. :2.000 Min. :3.000 Min. :2.000 Min. :3.000 Min. :1.000 Min. :1.00 Min. :3.000
Class :character 1st Qu.:4.0 1st Qu.:4.000 1st Qu.:4.750 1st Qu.:4.000 1st Qu.:2.250 1st Qu.:4.000 1st Qu.:2.750 1st Qu.:4.00 1st Qu.:3.000
Mode :character Median :4.5 Median :5.000 Median :5.000 Median :4.000 Median :3.500 Median :4.000 Median :3.000 Median :4.00 Median :3.000
NA Mean :4.5 Mean :4.643 Mean :4.583 Mean :4.071 Mean :3.333 Mean :4.357 Mean :3.083 Mean :3.75 Mean :3.643
NA 3rd Qu.:5.0 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:4.750 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:3.250 3rd Qu.:4.25 3rd Qu.:4.000
NA Max. :5.0 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
NA NA’s :3 NA’s :3 NA’s :5 NA’s :3 NA’s :11 NA’s :3 NA’s :5 NA’s :5 NA’s :3

Let’s create a plot of average book ratings with some data added from Goodreads.com

b = colMeans(df[sapply(df, is.numeric)],na.rm=TRUE) 

#Pulling in ratings from Goodreads.com for these 9 books
dfgr <- c(4.14, 4.75, 3.89, 4.22, 3.82, 4.31, 4.35, 4.45, 4.33)

qplot(names(df[2:10]), b, xlab = "Book", ylab = "Rating", color = "GoodReads") + theme(axis.text.x = element_text(angle = 45, hjust = 1))+ ggtitle("Average rating for each book") + geom_line(aes(y=dfgr),group=1)

We can see pretty similar results, with some exceptions such as Lord of the Rings, Hunger Games and Twilight. This can be explained by the fact that all the data was gathered among my friends and some bias due to similarity of taste is evident.

3. Predictions

The next step is to make some predictions using user-user and item-item colaborative filtering.

First let’s split our data into train and test set.

set.seed(11) #setting the seed to make results reproducible

which_train <- sample(x = c(TRUE, FALSE), size = nrow(df),
replace = TRUE, prob = c(0.8, 0.2))

data_train <- df[which_train, ]
data_test <- df[!which_train, ]
x<-data_test

First, let’s use IBCF model from “recommenderlab” library, this is a model that uses item-based collaborative filtering.

#Converting Train Data Frame to Real Rating Matrix
data_train <- as.matrix(data_train)
data_train <- as(data_train, "realRatingMatrix")
data_test <- as.matrix(data_test)
data_test <- as(data_test, "realRatingMatrix")

recc_model <- Recommender(data = data_train, method = "IBCF")

recc_predicted <- predict(object = recc_model, newdata = data_test, n = 2)
recc_predicted
## Recommendations as 'topNList' with n = 2 for 4 users.

Here are the Item-Item based recommendations for Users 1:

#Test Dataset
x %>% kable(caption = "Test Dataset") %>% kable_styling("striped", full_width = TRUE)
Test Dataset
X Wizard.s.First.Rule Harry.Potter..Series. Twilight..Series. Hitchhiker.s.Guide.to.the.Galaxy Jonathan.Strange.and.Mister.Norrell Master.and.Margarita Lord.of.the.Rings Song.of.Ice.and.Fire Hunger.Games
6 Alla 5 NA 5 4 NA 5 NA NA 4
9 Lana 5 5 5 4 NA 4 2 NA 4
13 Anya NA 4 5 3 NA 4 1 1 5
14 AndreyM 5 5 2 5 NA 5 5 4 3
#Predictions
recc_user_1 <- recc_predicted@items[[1]]
book_user_1 <- recc_predicted@itemLabels[recc_user_1]
book_user_1 %>% kable(caption = "User 1") %>% kable_styling("striped", full_width = FALSE)
User 1
x
Song.of.Ice.and.Fire
Harry.Potter..Series.

Now let’s use UBCF Model:

#Let's recreate test and train datasets
set.seed(14) 
which_train <- sample(x = c(TRUE, FALSE), size = nrow(df),
replace = TRUE, prob = c(0.8, 0.2))

data_train <- df[which_train, ]
data_test <- df[!which_train, ]
x<-data_test

#Converting Train Data Frame to Real Rating Matrix
data_train <- as.matrix(data_train)
data_train <- as(data_train, "realRatingMatrix")
data_test <- as.matrix(data_test)
data_test <- as(data_test, "realRatingMatrix")


recc_model <- Recommender(data = data_train, method = "UBCF")
recc_model
## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 11 users.
n_recommended <- 3
recc_predicted <- predict(object = recc_model, newdata = data_test, n = n_recommended) 
recc_predicted
## Recommendations as 'topNList' with n = 3 for 6 users.

Here are the User-User based recommendations for Users 2 and 3:

#Test Dataset
x %>% kable(caption = "Test Dataset") %>% kable_styling("striped", full_width = TRUE)
Test Dataset
X Wizard.s.First.Rule Harry.Potter..Series. Twilight..Series. Hitchhiker.s.Guide.to.the.Galaxy Jonathan.Strange.and.Mister.Norrell Master.and.Margarita Lord.of.the.Rings Song.of.Ice.and.Fire Hunger.Games
3 Mariya 4 NA 5 NA 2 5 2 NA 3
5 Irene 4 5 4 4 3 4 NA 5 3
7 Chris 4 4 NA 5 NA NA 5 4 NA
11 Ronny NA 5 5 NA NA NA 3 4 5
14 AndreyM 5 5 2 5 NA 5 5 4 3
15 Sally 5 5 5 5 NA 5 NA 5 3
#Predictions

recc_user_2 <- recc_predicted@items[[2]]
book_user_2 <- recc_predicted@itemLabels[recc_user_2]
book_user_2 %>% kable(caption = "User 2") %>% kable_styling("striped", full_width = FALSE)
User 2
x
Lord.of.the.Rings
recc_user_2 <- recc_predicted@items[[3]]
book_user_2 <- recc_predicted@itemLabels[recc_user_2]
book_user_2 %>% kable(caption = "User 3") %>% kable_styling("striped", full_width = FALSE)
User 3
x
Master.and.Margarita
Twilight..Series.
Hunger.Games

4. Conclusion

Both UBCF with IBCF results appear to be accurate - they both recommend books which users have not preiously read and books that have a high rating. UBCF is agreed to be more accurate for small dataset - so I will conclude that UBCF recommendation are better for my purposes. This project can be further expanded by adding users and increasing the scope beyond fantasy books. It would be interested to see how adding ‘tag’ type data can improve recommendations.