Elina Azrilyan - Data 612

1. Installing packages and loading our data:
2. Investigating the data
3. Predictions
4. Conclusion

In this project I am going to implement user-recommendation algorithms for Fantasy book ratings dataset I have collected from 18 people. The dataset is small so it is more of an example and it will be interesting to see how accurate of a recommendation I will be able to produce with that limitation. In this assignment I will explore a few tecniques for creating user recommendations.

1. Installing packages and loading our data:

options(warn=-1)
if(!"recommenderlab" %in% rownames(installed.packages())){
install.packages("recommenderlab")}
suppressMessages(library("ggplot2"))

suppressMessages(library("recommenderlab"))
suppressMessages(library(kableExtra))
df <- read.csv(file="https://raw.githubusercontent.com/che10vek/Data612/master/FantasyBookRatings18.csv", header=TRUE, stringsAsFactors = FALSE, sep=",")
head(df) %>% kable(caption = "Fantasy Book Ratings") %>% kable_styling("striped", full_width = TRUE)

Fantasy Book Ratings
X	Wizard.s.First.Rule	Harry.Potter..Series.	Twilight..Series.	Hitchhiker.s.Guide.to.the.Galaxy	Jonathan.Strange.and.Mister.Norrell	Master.and.Margarita	Lord.of.the.Rings	Song.of.Ice.and.Fire	Hunger.Games
Lina	5	4	5	3	NA	4	NA	NA	3
Lilya	4	5	5	4	5	3	NA	4	3
Mariya	4	NA	5	NA	2	5	2	NA	3
Yuliya	NA	5	NA	4	2	5	3	1	3
Irene	4	5	4	4	3	4	NA	5	3
Alla	5	NA	5	4	NA	5	NA	NA	4

df <- df[1:17,]

2. Investigating the data

Let’s inspect our data. We can see that even though the dataset is small - it has good data so there is no need to clean up anything besides the presence of NAs.

summary(df)%>% kable(caption = "Summary") %>% kable_styling("striped", full_width = TRUE)

Summary
`X </th>`	Wizard.s.First.Rule	Harry.Potter..Series.	Twilight..Series.	Hitchhiker.s.Guide.to.the.Galaxy	Jonathan.Strange.and.Mister.Norrell	Master.and.Margarita	Lord.of.the.Rings	Song.of.Ice.and.Fire	Hunger.Games
Length:17	Min. :4.0	Min. :4.000	Min. :2.000	Min. :3.000	Min. :2.000	Min. :3.000	Min. :1.000	Min. :1.00	Min. :3.000
Class :character	1st Qu.:4.0	1st Qu.:4.000	1st Qu.:4.750	1st Qu.:4.000	1st Qu.:2.250	1st Qu.:4.000	1st Qu.:2.750	1st Qu.:4.00	1st Qu.:3.000
Mode :character	Median :4.5	Median :5.000	Median :5.000	Median :4.000	Median :3.500	Median :4.000	Median :3.000	Median :4.00	Median :3.000
NA	Mean :4.5	Mean :4.643	Mean :4.583	Mean :4.071	Mean :3.333	Mean :4.357	Mean :3.083	Mean :3.75	Mean :3.643
NA	3rd Qu.:5.0	3rd Qu.:5.000	3rd Qu.:5.000	3rd Qu.:4.750	3rd Qu.:4.000	3rd Qu.:5.000	3rd Qu.:3.250	3rd Qu.:4.25	3rd Qu.:4.000
NA	Max. :5.0	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.00	Max. :5.000
NA	NA’s :3	NA’s :3	NA’s :5	NA’s :3	NA’s :11	NA’s :3	NA’s :5	NA’s :5	NA’s :3

Let’s create a plot of average book ratings with some data added from Goodreads.com

b = colMeans(df[sapply(df, is.numeric)],na.rm=TRUE) 

#Pulling in ratings from Goodreads.com for these 9 books
dfgr <- c(4.14, 4.75, 3.89, 4.22, 3.82, 4.31, 4.35, 4.45, 4.33)

qplot(names(df[2:10]), b, xlab = "Book", ylab = "Rating", color = "GoodReads") + theme(axis.text.x = element_text(angle = 45, hjust = 1))+ ggtitle("Average rating for each book") + geom_line(aes(y=dfgr),group=1)

We can see pretty similar results, with some exceptions such as Lord of the Rings, Hunger Games and Twilight. This can be explained by the fact that all the data was gathered among my friends and some bias due to similarity of taste is evident.

3. Predictions

The next step is to make some predictions using user-user and item-item colaborative filtering.

First let’s split our data into train and test set.

set.seed(11) #setting the seed to make results reproducible

which_train <- sample(x = c(TRUE, FALSE), size = nrow(df),
replace = TRUE, prob = c(0.8, 0.2))

data_train <- df[which_train, ]
data_test <- df[!which_train, ]
x<-data_test

First, let’s use IBCF model from “recommenderlab” library, this is a model that uses item-based collaborative filtering.

#Converting Train Data Frame to Real Rating Matrix
data_train <- as.matrix(data_train)
data_train <- as(data_train, "realRatingMatrix")
data_test <- as.matrix(data_test)
data_test <- as(data_test, "realRatingMatrix")

recc_model <- Recommender(data = data_train, method = "IBCF")

recc_predicted <- predict(object = recc_model, newdata = data_test, n = 2)
recc_predicted

## Recommendations as 'topNList' with n = 2 for 4 users.

Here are the Item-Item based recommendations for Users 1:

#Test Dataset
x %>% kable(caption = "Test Dataset") %>% kable_styling("striped", full_width = TRUE)

Test Dataset
	X	Wizard.s.First.Rule	Harry.Potter..Series.	Twilight..Series.	Hitchhiker.s.Guide.to.the.Galaxy	Jonathan.Strange.and.Mister.Norrell	Master.and.Margarita	Lord.of.the.Rings	Song.of.Ice.and.Fire	Hunger.Games
6	Alla	5	NA	5	4	NA	5	NA	NA	4
9	Lana	5	5	5	4	NA	4	2	NA	4
13	Anya	NA	4	5	3	NA	4	1	1	5
14	AndreyM	5	5	2	5	NA	5	5	4	3

#Predictions
recc_user_1 <- recc_predicted@items[[1]]
book_user_1 <- recc_predicted@itemLabels[recc_user_1]
book_user_1 %>% kable(caption = "User 1") %>% kable_styling("striped", full_width = FALSE)

User 1
x
Song.of.Ice.and.Fire
Harry.Potter..Series.

Now let’s use UBCF Model:

#Let's recreate test and train datasets
set.seed(14) 
which_train <- sample(x = c(TRUE, FALSE), size = nrow(df),
replace = TRUE, prob = c(0.8, 0.2))

data_train <- df[which_train, ]
data_test <- df[!which_train, ]
x<-data_test

#Converting Train Data Frame to Real Rating Matrix
data_train <- as.matrix(data_train)
data_train <- as(data_train, "realRatingMatrix")
data_test <- as.matrix(data_test)
data_test <- as(data_test, "realRatingMatrix")


recc_model <- Recommender(data = data_train, method = "UBCF")
recc_model

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 11 users.

n_recommended <- 3
recc_predicted <- predict(object = recc_model, newdata = data_test, n = n_recommended) 
recc_predicted

## Recommendations as 'topNList' with n = 3 for 6 users.

Here are the User-User based recommendations for Users 2 and 3:

#Test Dataset
x %>% kable(caption = "Test Dataset") %>% kable_styling("striped", full_width = TRUE)

Test Dataset
	X	Wizard.s.First.Rule	Harry.Potter..Series.	Twilight..Series.	Hitchhiker.s.Guide.to.the.Galaxy	Jonathan.Strange.and.Mister.Norrell	Master.and.Margarita	Lord.of.the.Rings	Song.of.Ice.and.Fire	Hunger.Games
3	Mariya	4	NA	5	NA	2	5	2	NA	3
5	Irene	4	5	4	4	3	4	NA	5	3
7	Chris	4	4	NA	5	NA	NA	5	4	NA
11	Ronny	NA	5	5	NA	NA	NA	3	4	5
14	AndreyM	5	5	2	5	NA	5	5	4	3
15	Sally	5	5	5	5	NA	5	NA	5	3

#Predictions

recc_user_2 <- recc_predicted@items[[2]]
book_user_2 <- recc_predicted@itemLabels[recc_user_2]
book_user_2 %>% kable(caption = "User 2") %>% kable_styling("striped", full_width = FALSE)

User 2
x
Lord.of.the.Rings

recc_user_2 <- recc_predicted@items[[3]]
book_user_2 <- recc_predicted@itemLabels[recc_user_2]
book_user_2 %>% kable(caption = "User 3") %>% kable_styling("striped", full_width = FALSE)

User 3
x
Master.and.Margarita
Twilight..Series.
Hunger.Games

4. Conclusion

Both UBCF with IBCF results appear to be accurate - they both recommend books which users have not preiously read and books that have a high rating. UBCF is agreed to be more accurate for small dataset - so I will conclude that UBCF recommendation are better for my purposes. This project can be further expanded by adding users and increasing the scope beyond fantasy books. It would be interested to see how adding ‘tag’ type data can improve recommendations.

Elina Azrilyan - Data 612 - Project 2

1. Installing packages and loading our data:

2. Investigating the data

3. Predictions

4. Conclusion