DATA 643 Project 1 - Global Baseline Predictors and RMSE

Data Loading and Preparation
Training and Testing
RMSE Calculations
Results
Discussion
Supportive links

The following demonstration is a partial film recommender system capable of recommending movies to users based on a rating scale. A testing and training dataset is extracted from the user rankings.csv available at grouplens.org/datasets/movielens/. The MovieLens ml-latest-small dataset was selected from the zip folder, and a user-item matrix was created.

Data Loading and Preparation

# Data loading
ratings <- read.csv("https://raw.githubusercontent.com/RobertSellers/643_RECOMMENDER_SYSTEMS/master/databases/ml-latest-small/ratings.csv")

# Converting the ratings into 2 dimensional matrix
library(reshape2)
uiMatrix <- acast(ratings, userId~movieId, value.var="rating")
uiMatrix <- apply(uiMatrix, 1,as.numeric)

Training and Testing

Create test and train datasets. 80% sample for training, and 20% sample for testing.

# Randomly sample the matrix as boolean reference
which_train <- sample(x = c(TRUE, FALSE), size = nrow(uiMatrix),replace = TRUE, prob = c(0.8, 0.2))

# Create the train/test datasets
train <- uiMatrix[which_train, ]
test <- uiMatrix[!which_train, ]

RMSE Calculations

# Raw mean of the training dataset
train.avg <- mean(train, na.rm =TRUE)

# Training data Root Mean Square Error for the raw average
RMSE = function(m, o){ # m : model values
  sqrt(mean((m - o)^2, na.rm =TRUE)) # o : observed values
}

RMSE(train,train.avg)

## [1] 1.055908

# Test data Root Mean Square Error for the raw average
RMSE(test,train.avg)

## [1] 1.066138

calcBias <- function(df, avg){
  # Converting null values and handling Bias for each user and each item
  userBias <- rowMeans(df, na.rm=T) - avg
  movieBias <- colMeans(df, na.rm=T) - avg
  
  # Baseline Predictors for movie and user bias
  columnBias <- matrix(, nrow(df), ncol(df))
  rowBias <- matrix(, nrow(df), ncol(df))
  
  j <- 1
  for(column in 1:ncol(df)){
    columnBias[, column] <- movieBias[j]
    j<-j+1
  }

  j <- 1
  for(row in 1:nrow(df)){
    rowBias[row,] <- userBias[j]
    j<-j+1
  }
  
  # Baseline calculation : movieBias matrix + userBias Matrix + raw mean average
  baseline <- columnBias + userBias + avg
  return (baseline)
}

baselineTrain <- calcBias(train, train.avg)

# Constraining the values between 0 and 5
baselineTrain[baselineTrain<0] <- 0 
baselineTrain[baselineTrain>5] <- 5

Results

# Test / Training RMSE for baseline predictors

# Training dataset RMSE
RMSE(baselineTrain,train)

## [1] 0.840045

# Raw mean of the test dataset
test.avg <- mean(test, na.rm =TRUE)

# Test dataset RMSE
baselineTest <- calcBias(test, test.avg)

# Training dataset RMSE
RMSE(baselineTest,test)

## [1] 0.8362919

Discussion

All in all, the RMSE of the training dataset are within 1 ranking point from the original value. It is my understanding that I should aspire to lower the RMSE and develop a better predictive method against the test dataset. Efforts to construct equal sized sparse (wide) data frames could be one solution, or matrix methods as yet unknown.

Supportive links

https://stackoverflow.com/questions/26207850/create-sparse-matrix-from-a-data-frame

https://stackoverflow.com/questions/26237688/rmse-root-mean-square-deviation-calculation-in-r

https://github.com/jeknov/movieRec/blob/master/movieRec_descr.Rmd

https://www.youtube.com/watch?v=dGM4bNQcVKI