Slope One
if("SlopeOne" %in% rownames(installed.packages()) == FALSE){
install_github(repo = "SlopeOne", username = "tarashnot")
}
library(SVDApproximation)
## Warning: replacing previous import 'data.table::melt' by 'reshape2::melt'
## when loading 'SVDApproximation'
## Warning: replacing previous import 'data.table::dcast' by 'reshape2::dcast'
## when loading 'SVDApproximation'
library(SlopeOne)
library(data.table)
data(ratings)
class(ratings)
## [1] "data.table" "data.frame"
dim(ratings)
## [1] 1000209 3
head(ratings)
## user item rating
## 1: 1 1 5
## 2: 6 1 4
## 3: 8 1 4
## 4: 9 1 5
## 5: 10 1 5
## 6: 18 1 4
summary(ratings)
## user item rating
## Min. : 1 Min. : 1 Min. :1.000
## 1st Qu.:1506 1st Qu.: 966 1st Qu.:3.000
## Median :3070 Median :1658 Median :4.000
## Mean :3025 Mean :1731 Mean :3.582
## 3rd Qu.:4476 3rd Qu.:2566 3rd Qu.:4.000
## Max. :6040 Max. :3706 Max. :5.000
names(ratings) <- c("user_id", "item_id", "rating")
ratings <- data.table(ratings)
samp <-sample(nrow(ratings),0.1 * nrow(ratings))
ratings <-ratings[samp,]
ratings[, user_id := as.character(user_id)]
ratings[, item_id := as.character(item_id)]
setkey(ratings, user_id, item_id)
set.seed(1)
in_train <- rep(TRUE, nrow(ratings))
in_train[sample(1:nrow(ratings), size = round(0.2 * length(unique(ratings$user_id)), 0) * 5)] <- FALSE
ratings_train <- ratings[(in_train)]
ratings_test <- ratings[(!in_train)]
ratings_train_norm <- normalize_ratings(ratings_train)
model <- build_slopeone(ratings_train_norm$ratings)
predictions <- predict_slopeone(model,
ratings_test[ , c(1, 2), with = FALSE],
ratings_train_norm$ratings)
unnormalized_predictions <- unnormalize_ratings(normalized = ratings_train_norm,
ratings = predictions)
rmse_slopeone <- sqrt(mean((unnormalized_predictions$predicted_rating - ratings_test$rating) ^ 2))
rmse_slopeone
## [1] 1.310448
Summary and Findings:
- Slope one was introduced by Daniel Lemire and Anna Maclachlan
- Item-based collaborative filtering
- Works with data.table objects
- Simple, using liner regression f(x) = ax + b
- Fast, but for huge dataset it needs a lot of RAM
- Accurate, as often on par with more complicated and computationally expensive algorithms