The following demonstration is a partial film recommender system capable of recommending movies to users based on a rating scale. A testing and training dataset is extracted from the user rankings.csv available at grouplens.org/datasets/movielens/. The MovieLens ml-latest-small dataset was selected from the zip folder, and a user-item matrix was created.
# Data loading
ratings <- read.csv("https://raw.githubusercontent.com/RobertSellers/643_RECOMMENDER_SYSTEMS/master/databases/ml-latest-small/ratings.csv")
# Converting the ratings into 2 dimensional matrix
library(reshape2)
uiMatrix <- acast(ratings, userId~movieId, value.var="rating")
uiMatrix <- apply(uiMatrix, 1,as.numeric) Create test and train datasets. 80% sample for training, and 20% sample for testing.
# Randomly sample the matrix as boolean reference
which_train <- sample(x = c(TRUE, FALSE), size = nrow(uiMatrix),replace = TRUE, prob = c(0.8, 0.2))
# Create the train/test datasets
train <- uiMatrix[which_train, ]
test <- uiMatrix[!which_train, ]# Raw mean of the training dataset
train.avg <- mean(train, na.rm =TRUE)
# Training data Root Mean Square Error for the raw average
RMSE = function(m, o){ # m : model values
sqrt(mean((m - o)^2, na.rm =TRUE)) # o : observed values
}
RMSE(train,train.avg)## [1] 1.055908
# Test data Root Mean Square Error for the raw average
RMSE(test,train.avg)## [1] 1.066138
calcBias <- function(df, avg){
# Converting null values and handling Bias for each user and each item
userBias <- rowMeans(df, na.rm=T) - avg
movieBias <- colMeans(df, na.rm=T) - avg
# Baseline Predictors for movie and user bias
columnBias <- matrix(, nrow(df), ncol(df))
rowBias <- matrix(, nrow(df), ncol(df))
j <- 1
for(column in 1:ncol(df)){
columnBias[, column] <- movieBias[j]
j<-j+1
}
j <- 1
for(row in 1:nrow(df)){
rowBias[row,] <- userBias[j]
j<-j+1
}
# Baseline calculation : movieBias matrix + userBias Matrix + raw mean average
baseline <- columnBias + userBias + avg
return (baseline)
}
baselineTrain <- calcBias(train, train.avg)
# Constraining the values between 0 and 5
baselineTrain[baselineTrain<0] <- 0
baselineTrain[baselineTrain>5] <- 5 # Test / Training RMSE for baseline predictors
# Training dataset RMSE
RMSE(baselineTrain,train)## [1] 0.840045
# Raw mean of the test dataset
test.avg <- mean(test, na.rm =TRUE)
# Test dataset RMSE
baselineTest <- calcBias(test, test.avg)
# Training dataset RMSE
RMSE(baselineTest,test)## [1] 0.8362919
All in all, the RMSE of the training dataset are within 1 ranking point from the original value. It is my understanding that I should aspire to lower the RMSE and develop a better predictive method against the test dataset. Efforts to construct equal sized sparse (wide) data frames could be one solution, or matrix methods as yet unknown.