Project 1: Global Baseline Predictors and RMSE

Introduction

The recommender system I would like to implement would recommend opera productions. This is mostly out of my interest in attending operas, but I believe this area also creates a few interesting challenges.

Since there is no widely-used system for tracking user ratings, like IMDb or Netflix for movies and TV, I believe it makes sense to concentrate on professional critics’ reviews. With some baseline provided by a user, it may be possible to line up a user with some professional critic’s opinion for future recommendations.

Some challenges of an opera recommender system are as follows:

Professional opera reviews are rarely graded on any scale (like stars or thumbs up). As such, language analysis may be necessary for transforming the data.
There are many components to an opera production. Users may be interested in only some, but not others. In addition to expected categories like composer, libretto author, specific singers, other categories may include director, conductor, genre, design. It is possible - and likely - for two productions of the same opera to receive vastly different reviews. I believe opera is fairly unique among arts in that many works are produced many times by multiple companies In fact, often multiple productions are shown in the same season.
Generally, opera audience skews older than other entertainment industries. There may be some resistance to technology (comparing to movie aggregators, for example).
I believe opera audience can be split into two categories. The first are very occassional viewers who generally go to see a specific singer or production with some buzz. The second are die-hard fans who go often, but who also have strong preferences and opinions. It is possible that neither camp is looking for any machine recommendations.

# Required libraries
library(caTools)  # Train/test Split
library(dplyr)
library(tidyr)

Data Set

For this project, I wanted to create a small data set based on real reviews. Using reviews from Bachtrack (https://bachtrack.com), which assigns stars to their reviews (on a 1 to 5 scale), I recorded some ratings for the Metropolitan Opera productions. I quickly realized that few productions have multiple reviews. So for this project, I have simply filled in some random ratings for a few productions and critics. Full data set is below.

# Data import
data <- read.csv(paste0("https://raw.githubusercontent.com/ilyakats/CUNY-DATA643/",
                        "master/Project1_opera.csv"))
colnames(data) <- gsub("\\.", " ", colnames(data))

Critic	Don Giovanni	Aida	Onegin	Turandot	Le Nozze di Figaro	Rigoletto	La Traviata
Oliver Brett	4	4	5	NA	NA	4	5
Ako Imamura	3	3	5	3	4	NA	3
Robert Levine	5	4	NA	NA	4	3	5
Stephen Raskauskas	2	3	4	2	3	4	4
Courtney Smith	3	NA	3	3	4	NA	4
Mark Pullinger	NA	5	NA	5	4	3	5
Ilya Kats	5	4	5	NA	NA	3	4

Trainig/Testing Split

Most manipulations and calculations were done using tidyverse. Data frame was converted to long form and split into training and testing sets based on 0.75 split ration.

# Convert to long form
split_data <- data %>% gather(key = Opera, value = Rating, -Critic)

# Randomly split all ratings for training and testing sets
set.seed(50)
split <- sample.split(split_data$Rating, SplitRatio = 0.75)

# Prepare training set
train_data <- split_data
train_data$Rating[!split] <- NA

# Prepare testing set
test_data <- split_data
test_data$Rating[split] <- NA

Training Set

Critic	Aida	Don Giovanni	La Traviata	Le Nozze di Figaro	Onegin	Rigoletto	Turandot
Ako Imamura	NA	3	3	4	5	NA	3
Courtney Smith	NA	3	4	NA	3	NA	3
Ilya Kats	NA	5	4	NA	5	3	NA
Mark Pullinger	5	NA	NA	4	NA	NA	5
Oliver Brett	4	NA	NA	NA	5	4	NA
Robert Levine	4	5	5	NA	NA	NA	NA
Stephen Raskauskas	3	2	4	3	4	4	2

Testing Set

Critic	Aida	Don Giovanni	La Traviata	Le Nozze di Figaro	Onegin	Rigoletto	Turandot
Ako Imamura	3	NA	NA	NA	NA	NA	NA
Courtney Smith	NA	NA	NA	4	NA	NA	NA
Ilya Kats	4	NA	NA	NA	NA	NA	NA
Mark Pullinger	NA	NA	5	NA	NA	3	NA
Oliver Brett	NA	4	5	NA	NA	NA	NA
Robert Levine	NA	NA	NA	4	NA	3	NA
Stephen Raskauskas	NA	NA	NA	NA	NA	NA	NA

Raw Average

# Get raw average
raw_avg <- sum(train_data$Rating, na.rm = TRUE) / 
  length(which(!is.na(train_data$Rating)))

# Calculate RMSE for raw average
rmse_raw_train <- sqrt(sum((train_data$Rating[!is.na(train_data$Rating)] - raw_avg)^2) /
                         length(which(!is.na(train_data$Rating))))
rmse_raw_test <- sqrt(sum((test_data$Rating[!is.na(test_data$Rating)] - raw_avg)^2) /
                        length(which(!is.na(test_data$Rating))))

Baseline Predictors

# Get critic and opera biases
critic_bias <- train_data %>% filter(!is.na(Rating)) %>% 
  group_by(Critic) %>%
  summarise(sum = sum(Rating), count = n()) %>% 
  mutate(bias = sum/count-raw_avg) %>%
  select(Critic, CriticBias = bias)
opera_bias <- train_data %>% filter(!is.na(Rating)) %>% 
  group_by(Opera) %>%
  summarise(sum = sum(Rating), count = n()) %>% 
  mutate(bias = sum/count-raw_avg) %>%
  select(Opera, OperaBias = bias)
train_data <- train_data %>% left_join(critic_bias, by = "Critic") %>%
  left_join(opera_bias, by = "Opera") %>%
  mutate(RawAvg = raw_avg) %>%
  mutate(Baseline = RawAvg + CriticBias + OperaBias)
test_data <- test_data %>% left_join(critic_bias, by = "Critic") %>%
  left_join(opera_bias, by = "Opera") %>%
  mutate(RawAvg = raw_avg) %>%
  mutate(Baseline = RawAvg + CriticBias + OperaBias)

# Calculate RMSE for baseline predictors
rmse_base_train <- sqrt(sum((train_data$Rating[!is.na(train_data$Rating)] - 
                               train_data$Baseline[!is.na(train_data$Rating)])^2) /
                          length(which(!is.na(train_data$Rating))))
rmse_base_test <- sqrt(sum((test_data$Rating[!is.na(test_data$Rating)] - 
                              test_data$Baseline[!is.na(test_data$Rating)])^2) /
                         length(which(!is.na(test_data$Rating))))

RMSE and Summary

This table shows RMSE values for training and testing sets and for raw average and baseline predictors.

	RMSE
Training: Raw Average	0.9123280
Training: Baseline Predictor	0.6016829
Testing: Raw Average	0.7395728
Testing: Baseline Predictor	0.8612990

Interestingly, RMSE improved using baseline predictors for the training set, but it worsened for the testing set. This is due to a small data set with very limited information. The testing set includes only 9 ratings. Data set was created almost randomly assigning ratings, so calculating specific biases, such as harsh critic or generally favorable performance, does not generate improvement since this information is not in the data set. I was able to adjust the data set and testing/training split to generate values that show consistent improvement of baseline predictors over raw averages; however, for illustration purposes I am reporting the initial results.