In this project I am going to implement user-recommendation algorithms for Fantasy book ratings dataset I have collected from 18 people. The dataset is small so it is more of an example and it will be interesting to see how accurate of a recommendation I will be able to produce with that limitation. In this assignment I will explore a few tecniques for creating user recommendations.
options(warn=-1)
if(!"recommenderlab" %in% rownames(installed.packages())){
install.packages("recommenderlab")}
suppressMessages(library("ggplot2"))
suppressMessages(library("recommenderlab"))
suppressMessages(library(kableExtra))
df <- read.csv(file="https://raw.githubusercontent.com/che10vek/Data612/master/FantasyBookRatings18.csv", header=TRUE, stringsAsFactors = FALSE, sep=",")
head(df) %>% kable(caption = "Fantasy Book Ratings") %>% kable_styling("striped", full_width = TRUE)
X | Wizard.s.First.Rule | Harry.Potter..Series. | Twilight..Series. | Hitchhiker.s.Guide.to.the.Galaxy | Jonathan.Strange.and.Mister.Norrell | Master.and.Margarita | Lord.of.the.Rings | Song.of.Ice.and.Fire | Hunger.Games |
---|---|---|---|---|---|---|---|---|---|
Lina | 5 | 4 | 5 | 3 | NA | 4 | NA | NA | 3 |
Lilya | 4 | 5 | 5 | 4 | 5 | 3 | NA | 4 | 3 |
Mariya | 4 | NA | 5 | NA | 2 | 5 | 2 | NA | 3 |
Yuliya | NA | 5 | NA | 4 | 2 | 5 | 3 | 1 | 3 |
Irene | 4 | 5 | 4 | 4 | 3 | 4 | NA | 5 | 3 |
Alla | 5 | NA | 5 | 4 | NA | 5 | NA | NA | 4 |
df <- df[1:17,]
Let’s inspect our data. We can see that even though the dataset is small - it has good data so there is no need to clean up anything besides the presence of NAs.
summary(df)%>% kable(caption = "Summary") %>% kable_styling("striped", full_width = TRUE)
| Wizard.s.First.Rule | Harry.Potter..Series. | Twilight..Series. | Hitchhiker.s.Guide.to.the.Galaxy | Jonathan.Strange.and.Mister.Norrell | Master.and.Margarita | Lord.of.the.Rings | Song.of.Ice.and.Fire | Hunger.Games | |
---|---|---|---|---|---|---|---|---|---|---|
Length:17 | Min. :4.0 | Min. :4.000 | Min. :2.000 | Min. :3.000 | Min. :2.000 | Min. :3.000 | Min. :1.000 | Min. :1.00 | Min. :3.000 | |
Class :character | 1st Qu.:4.0 | 1st Qu.:4.000 | 1st Qu.:4.750 | 1st Qu.:4.000 | 1st Qu.:2.250 | 1st Qu.:4.000 | 1st Qu.:2.750 | 1st Qu.:4.00 | 1st Qu.:3.000 | |
Mode :character | Median :4.5 | Median :5.000 | Median :5.000 | Median :4.000 | Median :3.500 | Median :4.000 | Median :3.000 | Median :4.00 | Median :3.000 | |
NA | Mean :4.5 | Mean :4.643 | Mean :4.583 | Mean :4.071 | Mean :3.333 | Mean :4.357 | Mean :3.083 | Mean :3.75 | Mean :3.643 | |
NA | 3rd Qu.:5.0 | 3rd Qu.:5.000 | 3rd Qu.:5.000 | 3rd Qu.:4.750 | 3rd Qu.:4.000 | 3rd Qu.:5.000 | 3rd Qu.:3.250 | 3rd Qu.:4.25 | 3rd Qu.:4.000 | |
NA | Max. :5.0 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.00 | Max. :5.000 | |
NA | NA’s :3 | NA’s :3 | NA’s :5 | NA’s :3 | NA’s :11 | NA’s :3 | NA’s :5 | NA’s :5 | NA’s :3 |
Let’s create a plot of average book ratings with some data added from Goodreads.com
b = colMeans(df[sapply(df, is.numeric)],na.rm=TRUE)
#Pulling in ratings from Goodreads.com for these 9 books
dfgr <- c(4.14, 4.75, 3.89, 4.22, 3.82, 4.31, 4.35, 4.45, 4.33)
qplot(names(df[2:10]), b, xlab = "Book", ylab = "Rating", color = "GoodReads") + theme(axis.text.x = element_text(angle = 45, hjust = 1))+ ggtitle("Average rating for each book") + geom_line(aes(y=dfgr),group=1)
We can see pretty similar results, with some exceptions such as Lord of the Rings, Hunger Games and Twilight. This can be explained by the fact that all the data was gathered among my friends and some bias due to similarity of taste is evident.
The next step is to make some predictions using user-user and item-item colaborative filtering.
First let’s split our data into train and test set.
set.seed(11) #setting the seed to make results reproducible
which_train <- sample(x = c(TRUE, FALSE), size = nrow(df),
replace = TRUE, prob = c(0.8, 0.2))
data_train <- df[which_train, ]
data_test <- df[!which_train, ]
x<-data_test
First, let’s use IBCF model from “recommenderlab” library, this is a model that uses item-based collaborative filtering.
#Converting Train Data Frame to Real Rating Matrix
data_train <- as.matrix(data_train)
data_train <- as(data_train, "realRatingMatrix")
data_test <- as.matrix(data_test)
data_test <- as(data_test, "realRatingMatrix")
recc_model <- Recommender(data = data_train, method = "IBCF")
recc_predicted <- predict(object = recc_model, newdata = data_test, n = 2)
recc_predicted
## Recommendations as 'topNList' with n = 2 for 4 users.
Here are the Item-Item based recommendations for Users 1:
#Test Dataset
x %>% kable(caption = "Test Dataset") %>% kable_styling("striped", full_width = TRUE)
X | Wizard.s.First.Rule | Harry.Potter..Series. | Twilight..Series. | Hitchhiker.s.Guide.to.the.Galaxy | Jonathan.Strange.and.Mister.Norrell | Master.and.Margarita | Lord.of.the.Rings | Song.of.Ice.and.Fire | Hunger.Games | |
---|---|---|---|---|---|---|---|---|---|---|
6 | Alla | 5 | NA | 5 | 4 | NA | 5 | NA | NA | 4 |
9 | Lana | 5 | 5 | 5 | 4 | NA | 4 | 2 | NA | 4 |
13 | Anya | NA | 4 | 5 | 3 | NA | 4 | 1 | 1 | 5 |
14 | AndreyM | 5 | 5 | 2 | 5 | NA | 5 | 5 | 4 | 3 |
#Predictions
recc_user_1 <- recc_predicted@items[[1]]
book_user_1 <- recc_predicted@itemLabels[recc_user_1]
book_user_1 %>% kable(caption = "User 1") %>% kable_styling("striped", full_width = FALSE)
x |
---|
Song.of.Ice.and.Fire |
Harry.Potter..Series. |
Now let’s use UBCF Model:
#Let's recreate test and train datasets
set.seed(14)
which_train <- sample(x = c(TRUE, FALSE), size = nrow(df),
replace = TRUE, prob = c(0.8, 0.2))
data_train <- df[which_train, ]
data_test <- df[!which_train, ]
x<-data_test
#Converting Train Data Frame to Real Rating Matrix
data_train <- as.matrix(data_train)
data_train <- as(data_train, "realRatingMatrix")
data_test <- as.matrix(data_test)
data_test <- as(data_test, "realRatingMatrix")
recc_model <- Recommender(data = data_train, method = "UBCF")
recc_model
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 11 users.
n_recommended <- 3
recc_predicted <- predict(object = recc_model, newdata = data_test, n = n_recommended)
recc_predicted
## Recommendations as 'topNList' with n = 3 for 6 users.
Here are the User-User based recommendations for Users 2 and 3:
#Test Dataset
x %>% kable(caption = "Test Dataset") %>% kable_styling("striped", full_width = TRUE)
X | Wizard.s.First.Rule | Harry.Potter..Series. | Twilight..Series. | Hitchhiker.s.Guide.to.the.Galaxy | Jonathan.Strange.and.Mister.Norrell | Master.and.Margarita | Lord.of.the.Rings | Song.of.Ice.and.Fire | Hunger.Games | |
---|---|---|---|---|---|---|---|---|---|---|
3 | Mariya | 4 | NA | 5 | NA | 2 | 5 | 2 | NA | 3 |
5 | Irene | 4 | 5 | 4 | 4 | 3 | 4 | NA | 5 | 3 |
7 | Chris | 4 | 4 | NA | 5 | NA | NA | 5 | 4 | NA |
11 | Ronny | NA | 5 | 5 | NA | NA | NA | 3 | 4 | 5 |
14 | AndreyM | 5 | 5 | 2 | 5 | NA | 5 | 5 | 4 | 3 |
15 | Sally | 5 | 5 | 5 | 5 | NA | 5 | NA | 5 | 3 |
#Predictions
recc_user_2 <- recc_predicted@items[[2]]
book_user_2 <- recc_predicted@itemLabels[recc_user_2]
book_user_2 %>% kable(caption = "User 2") %>% kable_styling("striped", full_width = FALSE)
x |
---|
Lord.of.the.Rings |
recc_user_2 <- recc_predicted@items[[3]]
book_user_2 <- recc_predicted@itemLabels[recc_user_2]
book_user_2 %>% kable(caption = "User 3") %>% kable_styling("striped", full_width = FALSE)
x |
---|
Master.and.Margarita |
Twilight..Series. |
Hunger.Games |
Both UBCF with IBCF results appear to be accurate - they both recommend books which users have not preiously read and books that have a high rating. UBCF is agreed to be more accurate for small dataset - so I will conclude that UBCF recommendation are better for my purposes. This project can be further expanded by adding users and increasing the scope beyond fantasy books. It would be interested to see how adding ‘tag’ type data can improve recommendations.