Music Recommendation - Hybrid Recommender System using Factorization Machines, Content Based Recommenders and SVD

library(knitr)
load('FM/FM.RData')
source('Requirements.R')
source('SVD/packages.R')
read_chunk('FM/factorization_machines.R')
read_chunk('SVD/fsvd.R')
read_chunk('Evaluation/Evaluation.R')
read_chunk('content_based/libraries.R')
read_chunk('content_based/read_filter.R')
read_chunk('content_based/feature_generation.R')
read_chunk('content_based/collaborative_filter.R')
read_chunk('content_based/knn.R')
read_chunk('hybrid.R')
options(tibble.print_max = 5, tibble.print_min = 5)

Introduction and Objectives

Not many years ago, it was inconceivable that the same person would listen to the Beatles, Vivaldi, and Lady Gaga on their morning commute. But, the glory days of Radio DJs have passed, and musical gatekeepers have been replaced with personalizing algorithms and unlimited streaming services.

While the public’s now listening to all kinds of music, algorithms still struggle in key areas. Without enough historical data, how would an algorithm know if listeners will like a new song or a new artist? And, how would it know what songs to recommend brand new users?

In this project, we aim to predict the chances of a user listening to a song repetitively after the first observable listening event within a time window was triggered. If there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, its target is marked 1, and 0 otherwise.
Our dataset come from a Kaggle competition, where KKBOX provides a training data set consisting of information of the first observable listening event for each unique user-song pair within a specific time duration. Metadata of each unique user and song pair is also provided.

The train and the test data are selected from users listening history in a given time period. The train and test sets are split based on time, and the split of public/private are based on unique user/song pairs.

Music Recommendation Rationale

Automatic music recommendation has become an increasingly relevant problem in recent years, since a lot of music is now sold and consumed digitally. In recent years, the music industry has shifted more and more towards digital distribution through online music stores and streaming services such as iTunes, Spotify, Grooveshark and Google Play.

Dataset and Exploration

library(tidyverse)
library(data.table)
library(Hmisc)
library(class)
library(recommenderlab)
library(dplyr)

The original dataset contains 7,377,418 observations, where each observation corresponds to the first observable listening event for each unique user-song pair. In order to train our models, we filter the data by choosing songs with at least 1000 occurrences and users with at least 500 occurrences. We then train on 75% of the filtered dataset and test on the other 25%. This results in a training set of 869,783 observations and a testing set of 289,928 observations.

options(tibble.print_max = 5, tibble.print_min = 5)
setwd("~/Desktop/drive/dsi/fall-17/personalization/Part II")
rm(list= ls())
train <- as.tibble(fread('data/train.csv'))
songs <- as.tibble(fread('data/songs.csv'))
song_extra_info <- as.tibble(fread('data/song_extra_info.csv'))

ratings <- as.data.table(train[c('msno', 'song_id', 'target')])

filter_song_data <- as.data.frame(table(ratings$song_id))
filter_user_data <- as.data.frame(table(ratings$msno))

names(filter_song_data) <- c('song_id', 'count')
names(filter_user_data) <- c('msno', 'count')
ratings$song_count <- filter_song_data$count[match(ratings$song_id, filter_song_data$song_id)]
ratings$user_count <- filter_user_data$count[match(ratings$msno, filter_user_data$msno)]

sc = 1000
uc = 500
filter_data <- subset(ratings, ratings$song_count>= sc & ratings$user_count >=uc)
nrow(filter_data)/nrow(ratings)
paste('Users:', length(unique(filter_data$msno)))
paste('Songs:', length(unique(filter_data$song_id)))

user_key <- as.data.frame(cbind(unique(filter_data$msno),c(1:length(unique(filter_data$msno)))))
names(user_key) <- c('user_code', 'msno')
song_key <- as.data.frame(cbind(unique(filter_data$song_id), c(1:length(unique(filter_data$song_id)))))
names(song_key) <- c('song_code', 'song_id')
filter_data$msno <- user_key$msno[match(filter_data$msno,user_key$user_code)]
filter_data$song_id <- song_key$song_id[match(filter_data$song_id, song_key$song_code)]


set.seed(666)
train_prop = 0.75
train_test_split <- function(ratings, train_proportion = 0.8){
  sample_size <- floor(train_proportion*nrow(ratings))
  train_ind <- sample(seq_len(nrow(ratings)), size = sample_size)
  train_data <- ratings[train_ind,]
  test_data <- ratings[-train_ind,]
  split_data <- list('train_data' = train_data, 'test_data' = test_data)
  return(split_data)
}
split_data <- train_test_split(filter_data,0.75)
training_data <- split_data$train_data
testing_data <- split_data$test_data
songs$song_id <- song_key$song_id[match(songs$song_id, song_key$song_code)]
songs <- subset(songs, songs$song_id %in% filter_data$song_id)
song_extra_info$song_id <- song_key$song_id[match(song_extra_info$song_id, song_key$song_code)]
song_extra_info <- subset(song_extra_info, song_extra_info$song_id %in% filter_data$song_id)

training_data <- as.tibble(training_data)

Training Data

Let’s start by examining how each variable in the training_data affects the target. We base our exploratory data analysis off of the following kernel: https://www.kaggle.com/adiamaan/eda-and-feature-engineering

Defining useful functions:

## ggplot setting for readable labels
readable_labs <- theme(axis.text=element_text(size=12),
                     axis.title=element_text(size=14),
                     plot.title = element_text(hjust = 0.5))

# Function to dislpay count of each category of the column and plot how it affects target
target_vs_column <-function(df, col_name, x , y, title){
    temp_df <- df %>% 
            group_by_(col_name) %>% 
            summarize(count = n(), mean_target = mean(target)) %>% 
            arrange(desc(mean_target)) 
    
    df_plot <- temp_df %>%  
              ggplot(aes_string(col_name, "mean_target")) + 
              geom_col(aes(fill=count)) +
              scale_fill_gradient(low='turquoise', high = 'violet')+
              coord_flip() +
              labs(x = x,
                   y = y,
                   title= title) +
              readable_labs
    print(df_plot)
    return (temp_df)
}

# Function to group songs and user by count and check it agains mean_target
target_vs_colcount <- function(df, col_name, x, y, title){
    df %>% 
      group_by_(col_name) %>% 
      summarize(count = n(), mean_target = mean(target)) %>% 
      group_by(count) %>% 
      summarize(new_count = n(), avg_target = mean(mean_target)) %>% 
      rename(no_of_items = new_count, occurence = count) %>% 
      arrange(desc(avg_target)) %>% 
      ggplot(aes(occurence, avg_target)) +
        geom_line(color='turquoise') +
        geom_smooth(color='turquoise') +
        labs(x = x,
             y = y,
             title= title) +
        readable_labs
}

Song count and User count vs target

Song count vs Target

Songs are grouped together and their count is checked against the target variable. The plot below shows the relationship between discoverability and mean target value. It is clear from the plot that the more a song occurs, the more it is discoverable by the user.

target_vs_colcount(training_data, "song_id", "Song Occurence", "Target", "Song Occurence vs Target")

User count vs Target

There is also a trend between how frequent of a listener the user is and the likelihood that the will listen to a song repetitively.

target_vs_colcount(training_data, "msno", "User Occurence", "Target", "User Occurence vs Target")

Target is unbalanced

Finally, we note that the target variable in our filtered dataset is not balanced. About 63% of the observations in the training data have a target value of one.

training_data %>% 
  group_by(target) %>% 
  count

Members

Let’s move on to the Members metadata

In the members dataframe, city, bd, gender, registered via are categorical vairables and registration init and expiration date are dates. We choose not to use registered via or the date features in our models, so we will not explore these variables here.

Again we define some useful functions for exploration to help us determine how each of these variables affects the target:

members_colgroup <- function(df,col_name, x, y, title, xmin, xmax, ymin, ymax){
                      
    temp_df <- df %>% 
                  group_by_(col_name) %>% 
                  count() %>% 
                  arrange(desc(n))
    
    df_plot <- temp_df %>% 
                    ggplot(aes_string(col_name, "n")) + 
                    geom_col(fill='goldenrod2') + 
                    labs(x = x,
                         y = y,
                         title = title) +
                    xlim(xmin, xmax) +
                    ylim(ymin, ymax) +
                    readable_labs
    
    print(df_plot)
    return(temp_df)

}

Distribution of city, bd(age), gender

Age

There seem to be some outliers in the age field, so we will have to clean these before we build our models.
Sorted bd (age) vs Frequency is shown in the tibble as well as the graph. There are 1578 records with 0 as age. This could be either outliers or missing values. We plot in the age range 1-100 to show the real distribution.

members_colgroup(members, "bd", "Age", "Frequency", "Age Distribution", 1, 100, 0, 200)

## Warning: Removed 5 rows containing missing values (position_stack).

City

members_colgroup(members, "city", "City", "Frequency", "City Distribution", 0, 25, 0, 1500)

Gender

Male and female are almost equal. We have a lot of missing values, as well.

members %>% 
  group_by(gender) %>% 
  count

Songs

Finally, let’s examine the available song metadata. These features will be useful for both our content-based recommender and factorization machine model.

songs

Top Items

Let’s see top 100 frequent items in each category,

top_100 <- function(df, col_name)
{
  temp_df <- df %>% 
    group_by_(col_name) %>% 
    count %>% 
    arrange(desc(n)) %>% 
    print
  
  return(temp_df)
}

Top 100 Artists

artist_count <- top_100(songs, "artist_name")

## # A tibble: 363 x 2
## # Groups:   artist_name [363]
##         artist_name     n
##               <chr> <int>
## 1 周杰倫 (Jay Chou)    54
## 2   Various Artists    47
## 3   五月天 (Mayday)    40
## 4     田馥甄 (Hebe)    26
## 5   林俊傑 (JJ Lin)    23
## # ... with 358 more rows

Top 100 Lyricist

lyricist_count <- top_100(songs, "lyricist")

## # A tibble: 458 x 2
## # Groups:   lyricist [458]
##   lyricist     n
##      <chr> <int>
## 1            303
## 2     阿信    36
## 3   方文山    29
## 4   姚若龍    18
## 5   周杰倫    17
## # ... with 453 more rows

Top 100 composer

composer_count <- top_100(songs, "composer")

## # A tibble: 629 x 2
## # Groups:   composer [629]
##   composer     n
##      <chr> <int>
## 1            109
## 2   周杰倫    55
## 3     阿信    20
## 4   林俊傑    17
## 5   陳皓宇    16
## # ... with 624 more rows

Top 100 Language

language_count <- top_100(songs, "language")

## # A tibble: 7 x 2
## # Groups:   language [7]
##   language     n
##      <dbl> <int>
## 1        3   719
## 2       52   199
## 3       31   122
## 4       10    13
## 5       17     9
## # ... with 2 more rows

21 songs have same artist and lyricist name.
402 songs have same lyricist and composer name.
27 songs have same artist and composer name.
21songs have same artist and lyricist name.

Top Genre’s

Genre id is a multi label column with a minumum label of 1 to a maximum label of 8.
There are 192 unique genres. There are some missing values as well.

genre_count <- songs %>% 
                  separate(genre_ids, c("one", "two", "three", "four", "five", "six", 
                                        "seven", "eight"), extra="merge") %>% 
                  select(one:eight)%>% 
                  gather(one:eight, key="nth_id", value="genre_ids", na.rm=TRUE) %>% 
                  group_by(genre_ids) %>% 
                  count %>% 
                  arrange(desc(n)) %>% 
                  print()

## # A tibble: 19 x 2
## # Groups:   genre_ids [19]
##   genre_ids     n
##       <chr> <int>
## 1       465   531
## 2       458   289
## 3       921    69
## 4       444    56
## 5      1609    52
## # ... with 14 more rows

Distribution of song length

songs %>% 
  mutate(song_length = song_length/6e4) %>% 
  ggplot(aes(song_length)) +
  geom_histogram(binwidth = 0.25, fill='darkorchid3') +
  labs(x='Song Length', y = 'Frequency', title = 'Distribution of song length') +
  xlim(0, 10)

Collaborative Filter

We build item based and user based collaborative filter model by using the recommenderlab package to take it as a baseline for the future models. Important parameters for the model are:
* Similarity Measure: Jaccard, as the ratings are binary * Number of neighbors: 30

Item Based CF

user_item_rating <- dcast(training_data, formula = msno~song_id, value.var = 'target')
training_sparse <- as(as.matrix(user_item_rating[-1]), 'realRatingMatrix')

ibcf <- Recommender(data = training_sparse, method = "IBCF", parameter = list(method = "Jaccard"))
predicted_ratings <- predict(ibcf, training_sparse, type = 'ratings')
predicted_ratings <- as(predicted_ratings, 'matrix')
predicted_ratings <- as.data.frame(predicted_ratings)
predicted_ratings$msno <- user_item_rating$msno
predicted_ratings_melt <- melt(predicted_ratings, id.vars = 'msno', variable.name = 'song_id')
results_ibcf <- merge(testing_data, predicted_ratings_melt, by = c('msno', 'song_id'))
results_ibcf$predicted_class <- 1
results_ibcf$predicted_class[results_ibcf$value < 0.5] <- 0

User Based CF

ubcf <- Recommender(data = training_sparse, method = "UBCF", parameter = list(method = "Jaccard"))
predicted_ratings <- predict(ubcf, training_sparse, type = 'ratings')
predicted_ratings <- as(predicted_ratings, 'matrix')
predicted_ratings <- as.data.frame(predicted_ratings)
predicted_ratings$msno <- user_item_rating$msno
predicted_ratings_melt <- melt(predicted_ratings, id.vars = 'msno', variable.name = 'song_id')
results_ubcf <- merge(testing_data, predicted_ratings_melt, by = c('msno', 'song_id'))
results_ubcf$predicted_class <- 1
results_ubcf$predicted_class[results_ubcf$value < 0.5] <- 0
results_ubcf$random_class <- 1

Baseline Accuracy Measures

conf_matrix_cf_ibcf <- table(results_ibcf$target, results_ibcf$predicted_class)
ibcf_accuracy <- (conf_matrix_cf_ibcf[1,1]+conf_matrix_cf_ibcf[2,2])/sum(conf_matrix_cf_ibcf)
conf_matrix_cf_ubcf <- table(results_ubcf$target, results_ubcf$predicted_class)
ubcf_accuracy <- (conf_matrix_cf_ubcf[1,1]+conf_matrix_cf_ubcf[2,2])/sum(conf_matrix_cf_ubcf)
conf_matrix_rand <- table(results_ubcf$target, results_ubcf$random_class)
rand_accuracy <- (conf_matrix_rand[2,1])/sum(conf_matrix_rand)

cf_accuracy <- as.data.frame(cbind(c('item_based', 'user_based', 'random'), c(ibcf_accuracy, ubcf_accuracy, rand_accuracy)))
names(cf_accuracy) <- c('model', 'accuracy')
cf_accuracy$accuracy_r <- round(as.numeric(as.character(cf_accuracy$accuracy)), 4)

paste('Random Prediction Accuracy is: ', round(rand_accuracy,4))

## [1] "Random Prediction Accuracy is:  0.6328"

paste('Item-Based Collaborative Filter Accuracy is: ', round(ibcf_accuracy,4))

## [1] "Item-Based Collaborative Filter Accuracy is:  0.7073"

print('IBCF Confusion Matrix')

## [1] "IBCF Confusion Matrix"

conf_matrix_cf_ibcf

##    
##          0      1
##   0  40460  65997
##   1  18875 164592

paste('User-Based Collaborative Filter Accuracy is: ', round(ubcf_accuracy,4))

## [1] "User-Based Collaborative Filter Accuracy is:  0.6743"

print('UBCF Confusion Matrix')

## [1] "UBCF Confusion Matrix"

conf_matrix_cf_ubcf

##    
##          0      1
##   0  30538  75919
##   1  18514 164953

p <- ggplot(data=cf_accuracy, aes(x=model, y=accuracy_r)) +
  geom_bar(stat="identity", fill="steelblue")+ ylim(0,0.8) +
  ggtitle('Collaborative Filtering Models') +
  theme_minimal()
p

We can see that item-based and user-based collaborative filters are better performing than recommending every song to the user.

Factorization Machines

Model Summary:

We first explore factorization machines. At their core, factorization machines allow us to model feature-rich datasets by including higher order interactions weighted by the inner product of latent vectors. As a result, we can estimate reliable parameters even in highly sparse data! Factorization machines consist of the following parameters that must be learned (where n is the number of features in the dataset and k is the dimenstionality of the latent features):

\(w_0\) – the global bias term
\(\mathbf{w}\) – a vector of weights for each of the n input variables
\(V\) – an \(n \times k\) matrix, where the inner product of the \(i^{th}\) and \(j^{th}\) row of \(V\) is the weight of the interaction between the \(i^{th}\) and \(j^{th}\) input variable.

The model is written as: \[\hat{y}(\mathbf{x}) := w_0 + \sum_{j=1}^pw_jx_j + \sum_{j=1}^p \sum_{j'=j+1}^p x_jx_{j'} \sum_{f=1}^kv_{j,f}v_{j'\,f}\]

The image above (from Rendle [2010]) shows the format the data must be in to train a factorization machine model. Each feature vector \(\mathbf{x}\) is used to predict each target value \(y\). In our dataset, we one hot encode songs (song_id) and members (msno). Additionally, we include the following features from the metadata to improve accuracy of predictions:

Song metadata:

Language
Artist name
Genre (first listed if multiple)
Number of genres
Song length

Member metadata:

City
Gender
Birthday (age)

Additional features:

Source system tab
Source screen name
Source type

We begin by merging the training dataset with the song and member metadata and doing some data cleaning and feature engineering:

all <- data.table(rbind(training_data, testing_data))

# clean song genre
songs %<>% 
  separate(genre_ids, c("one", "two", "three", "four", "five", "six", "seven", "eight"), 
           extra="merge") %>%
  replace_na(list(one = 0, two = 0, three = 0, four = 0,
                  five = 0, six = 0, seven = 0, eight = 0)) %>% 
  mutate(no_of_genre = (if_else(one == 0, 0, 1) + if_else(two == 0, 0, 1) +
                          if_else(three == 0, 0, 1) + if_else(four == 0, 0, 1) +
                          if_else(five == 0, 0, 1) + if_else(six == 0, 0, 1) +
                          if_else(seven == 0, 0, 1) + if_else(eight == 0, 0, 1))) %>% 
  select(song_id, song_length, language, artist_name, no_of_genre, one)

#add metadata
merged_all <- all %>% 
  left_join(songs, by = "song_id") %>%
  left_join(members, by = "msno") %>%
  left_join(train, by = c("msno", "song_id", "target"))

#drop variables from the merged data
merged_all %<>%
  select(-song_count, -user_count, -registered_via, -registration_init_time, -expiration_date)

#clean data: categorical variables must be coded as factors to train the model
merged_all <- within(merged_all, {
  bd[bd < 14 | bd > 100] <- median(bd) #set age outliers to median
  logbd <- log(bd)
  log_song_length <- log10(song_length)
  one <- factor(one)
  msno <- factor(msno)
  song_id <- factor(song_id)
  artist_name <- factor(artist_name)
  language <- factor(language)
  city <- factor(city)
  gender <- factor(gender)
  source_system_tab <- factor(source_system_tab)
  source_screen_name <- factor(source_screen_name)
  source_type <- factor(source_type)
}) 

merged_training <- merged_all[1:nrow(training_data),]
merged_testing <- merged_all[-c(1:nrow(training_data)),]

The dataset is large, so for the sake of efficiency, we use the R library libFMexe which is an interface to the C++ libFM library by Steffen Rendle. This allows us to run many models and tune k, the hyperparameter for the number of latent dimensions.

While we explored factorization machines with several different learning methods (e.g. ALS and SGD), the model we present here is a second order factorization machine using Markov chain Monte Carlo. Stochastic gradient descent and alternating least squares require additional parameter tuning to determine the optimal learning rate (SGD) and regularization terms (SGD and ALS), which is simply not feasible given the size of the data set and our compute resources. Furthermore, from preliminary models, MCMC appeared to outperform both SGD and ALS, so the choice of MCMC as the learning method for the model is obvious to maximize prediction accuracy.

Tuning Hyperparameters

Choosing k:

We use 3-fold cross validation with 100 iterations of MCMC to determine the optimal number of latent dimensions k. For each fold, the model outputs a vector of probabilities equal to the size of the held out data. This vector is converted to a vector of binary values based on a probability threshold of 0.5. The accuracy is then taken to be the number of target values that the model correctly predicted divided by the total number of target values in the held out data. The error rate is (1 – accuracy).

k.vec <- seq(0, 30, by = 5)
cv_k <- cv_libFM(merged_training, target ~ song_id + language + artist_name +
                   one + no_of_genre + msno + city + gender + logbd + log_song_length +  
                   source_system_tab + source_screen_name + source_type,
                 task = "c", dims = k.vec, iter = 100, folds = 3,
                 cv_verbosity = 1, verbosity = 1, method = "mcmc")

We analyze the results of the cross validation parameter tuning by plotting the mean error rate across the folds as a function of k. We choose k with the lowest average error rate across the three folds, or equivalently the highest average accuracy. We note that while the error rate for the model to compa re models. The following plot displays the error rate as a function of k, the number of latent dimensions. We also display a table of the mean error rate across the three folds (with standard error) for each k.

plot_param_cv <- function(k, m, err.bar.width = .1){
  # Plot Parameter Tuning CV Results
  ## input: 
  # k: vector of values to test for the hyperparamter
  # m: an n x k matrix of accuracy rates (n = # of folds)
  ## returns list of:
  # data frame of parameter values, mean error rate, and standard error
  # ggplot of the error rate vs. value of the hyperparameter
  
  err <- 1 - m
  mean <- apply(err, 2, mean)
  se <- apply(err, 2, function(x) sd(x)/sqrt(length(x)))
  df <- data.frame(cbind(k, mean, se))
  
  p <- ggplot(df, aes(x=k, y=mean)) + 
    geom_errorbar(aes(ymin=mean-se, ymax=mean+se), 
                  width = err.bar.width, color='blue' ) +
    geom_line(color='blue') +
    ylab("Error Rate") +
    theme(plot.title = element_text(hjust = 0.5))
  
  return(list(df, p))
}
#use the fm_k_cv data frame in the FM.RData file which contains the console output of cv_libFM
cv_k_results <- plot_param_cv(k.vec, fm_k_cv, err.bar.width = .3)
as.tibble(cv_k_results[[1]])

cv_k_results[[2]] + xlab('Number of Latent Dimensions (k)') +
  ggtitle('Tuning Hyperparameter for Latent Dimensionality')

k <- k.vec[which.min(cv_k_results[[1]]$mean)]
k

## [1] 25

Choosing initialization standard deviation

Next, we use cross-validation to determine the optimal standard deviation initialization. This parameter corresponds to the standard deviation of the normal distribution that is used for initializing the parameters \(V\). It is important that we choose a good value for this parameter so that the error rate converges quickly. Again, we use 3-fold cross validation with 100 iterations of MCMC to tune this parameter.

sd.vec <- c(.01, .05, .1, .15, .2)
cv_sd <- cv_libFM(merged_training, target ~ song_id + language + artist_name +
                   one + no_of_genre + msno + city + gender + logbd + log_song_length +  
                   source_system_tab + source_screen_name + source_type,
                 task = "c", dim = k, init_stdevs = sd.vec, iter = 100, folds = 3,
                 cv_verbosity = 1, verbosity = 1, method = "mcmc")

#use the fm_sd_cv data frame in the FM.RData file which contains the console output of cv_libFM
cv_sd_results <- plot_param_cv(sd.vec, fm_sd_cv, err.bar.width = .003)
cv_sd_results[[1]]

cv_sd_results[[2]] + xlab("Initialized Standard Deviation") + 
  ggtitle('Tuning Initialized Standard Deviation')

The optimal standard deviation is clearly .1.

FM Model

From the parameter tuning above, we determine that k = 25 latent dimensions gives us the highest accuracy. Additionally, an initialized standard deviation of 0.1 leads to the quickest error rate convergence. Thus, we will set the hyperparamters to these values for our final model. We will train our final factorization machine model using 500 iterations of MCMC to ensure convergence in error rate. We then use this model to generate predictions for the testing dataset.

#create an FM model with chosen  hyperparameters (k = 25 and init_stdev = .1)
predFM.mcmc <- libFM(merged_training, merged_testing, target ~ song_id + language + artist_name +
                        one + no_of_genre + msno + city + gender + logbd + log_song_length +  
                        source_system_tab + source_screen_name + source_type,
                      task = "c", dim = k, init_stdev = .1, iter = 500,
                      verbosity = 1, method = "mcmc")

#predFM.mcmc.accuracy contains the console output of libFM 
#(that is, the training and testing accuracy for each iteration of MCMC)
predFM.mcmc.err <- 1 - predFM.mcmc.accuracy
predFM.mcmc.err <- cbind(Iteration = 1:500, predFM.mcmc.err)
names(predFM.mcmc.err)[2:3] <- c('Train', 'Test')

We look at the training and testing error as a function of the number of iterations of MCMC to ensure the error rate converges in 500 iterations.

err.conv <- predFM.mcmc.err %>% melt(id = 1) %>%
  ggplot(aes(x = Iteration, y = value, color = variable)) +
  geom_line() + ylab('Error Rate') + 
  ggtitle('Convergence of Training and Testing Error') +
  theme(legend.title=element_blank(), plot.title = element_text(hjust = 0.5)) 
err.conv

The decrease in error rate is very small after approximately 300 iterations, for we conclude that 500 iterations is sufficient to train the model.

Evaluation

Finally, we evaluate the model according to the following metrics:

accuracy
error
precision
recall
F1
auc

Additionally, we display the confusion matrix and an ROC curve for the predictions.

We start by writing a generic evaluation function, which we will use to compare the performance of different models:

evaluate <- function(scores, target){
  ## Evaluate Predictions
  ## input 
  # scores: vector of scores 
  # (i.e. probability that an observation should have target value 1)
  # target: vector of actual target values
  ## returns 
  # dataframe of evaluation metrics
  # confusion matrix for predicted and actual target values
  # roc object for plotting
  
  pred.target <- ifelse(scores >= 0.5, 1, 0)
  accuracy <- sum(pred.target == target)/length(target)
  error <- 1 - accuracy
  
  conf.mat <- table(pred.target, target, dnn = c("predicted","actual"))
  precision <- conf.mat[2,2]/sum(conf.mat[2,])
  recall <- conf.mat[2,2]/sum(conf.mat[,2])
  F1 <- 2 * (precision*recall/(precision + recall))
  auc <- auc(target, scores)
  eval.metrics <- data.frame(cbind(accuracy, error, precision, recall, F1, auc))
  
  roc <- roc(target, scores, direction="<")
  
  return(list(eval.metrics, conf.mat, roc))
}

We then call this function on the output probabilities of the FM model:

FM.eval <- evaluate(scores = predFM.mcmc, target = testing_data$target)
FM.roc.plot <- ggroc(FM.eval[[3]], alpha = 0.5, colour = "blue", linetype = 2, size = 2) +
  theme(plot.title = element_text(hjust = 0.5))

Finally, we display the evaluation metrics and the ROC plot.

FM.eval[[1]] #print evaluation metrics

FM.eval[[2]] #print confusion matrix

##          actual
## predicted      0      1
##         0  62209  28732
##         1  44250 154737

FM.roc.plot + ggtitle('ROC Curve for Factorization Machine Model') +
  theme(plot.title = element_text(hjust = 0.5))

We find that the accuracy of approximately 0.7482 is well above that of user-based collarborative filtering (0.67), item-based collaborative filtering (0.71), and the random baseline model (0.63). Additionally, the recall for this model is fairly high. Out of all of the observations in the testing dataset with target value 1, our model was able to retrieve 84.3% of them. This also indicates that the coverage of our model is quite good, since we are not only predicting a target value of 1 for the most popular songs and most active users. Still, our prediction accuracy might be affected by the popularity of a given song. We explore this relationship in the following plot:

FM.test.data <- cbind(merged_testing, predFM.mcmc)

FM.test.data %>% 
  group_by(song_id) %>% 
  summarize(count = n(), accuracy = sum(predFM.mcmc = target)/n()) %>%
  group_by(count) %>% 
  summarize(new_count = n(), avg_acc = mean(accuracy)) %>% 
  rename(no_of_items = new_count, occurence = count) %>% 
  arrange(desc(avg_acc)) %>% 
  ggplot(aes(occurence, avg_acc)) +
  geom_line(color='turquoise') +
  geom_smooth(color='turquoise') +
  labs(x = "Song Occurence",
       y = "Accuracy",
       title= "Song Occurence vs FM Predictive Accuracy")

## `geom_smooth()` using method = 'loess'

There is a clear upward trend indicating that the FM model performs better on more popular songs.

Content Based Recommender

In this section, we build the content based recommendation from scratch without using any ready to build package.

Content-based systems are designed to exploit scenarios in which items can be described with descriptive sets of attributes.
This approach is particularly useful when the item is new, and there are few ratings available for that item.
Unlike collaborative systems, which explicitly leverage the ratings of other users in addition to that of the target user, content-based systems largely focus on the target user’s own ratings and the attributes of the items liked by the user. Therefore, the other users have little, if any, role to play in content-based systems.
Content-based systems are largely used in scenarios in which a significant amount of attribute information is available at hand. In this case, we have structured data in the form of song length, genres, artists, composer, lyricist and language.
The general architecture of a content based recommender can be visualized through the following diagram.

We do the following step to build our content based recommender by utilising these song features.

Preprocessing and Feature Extraction

In the feature extraction phase, the descriptions of various items are extracted. Although it is possible to use any kind of representation, such as a multidimensional data representation, the most common approach.
We generate one hot encoded vectors for the items to be represented in the feature space.

genres <- as.data.frame(songs$genre_ids, stringsAsFactors=FALSE)
genres <- as.data.frame(tstrsplit(genres[,1], '[|]', type.convert=TRUE), stringsAsFactors=FALSE)
colnames(genres) <- c(1:3)
genres$song_id <- songs$song_id

genre_data <- genres[c(4,1)]
for(i in 2:3){
  tester <- genres[c(4,i)]
  tester <- na.omit(tester)
  names(tester) <- names(genre_data)
  genre_data <- rbind(genre_data, tester)
}
names(genre_data) <- c('song_id', 'genre')
genre_binary <- dcast(genre_data, formula = song_id~genre, fun.aggregate = NULL)
genre_binary[-1][!is.na(genre_binary[-1])] <- 1
genre_binary[is.na(genre_binary)] <- 0

lang_data <- songs[c('song_id', 'language')]
lang_data <- na.omit(lang_data)
lang_binary <- dcast(lang_data, formula = song_id~language, fun.aggregate = NULL)
lang_binary[-1][!is.na(lang_binary[-1])] <- 1
lang_binary[is.na(lang_binary)] <- 0

artists <- as.data.frame(songs$artist_name, stringsAsFactors=FALSE)
artists <- as.data.frame(tstrsplit(artists[,1], '[|]', type.convert=TRUE), stringsAsFactors=FALSE)
colnames(artists) <- c(1:3)
artists$song_id <- songs$song_id

artists_data <- artists[c(4,1)]
for(i in 2:3){
  tester <- artists[c(4,i)]
  tester <- na.omit(tester)
  names(tester) <- names(artists_data)
  artists_data <- rbind(artists_data, tester)
}
names(artists_data) <- c('song_id', 'artist')
artists_binary <- dcast(artists_data, formula = song_id~artist, fun.aggregate = NULL)
artists_binary[-1][!is.na(artists_binary[-1])] <- 1
artists_binary[is.na(artists_binary)] <- 0

composer <- as.data.frame(songs$composer, stringsAsFactors=FALSE)
composer <- as.data.frame(tstrsplit(composer[,1], '[|]', type.convert=TRUE), stringsAsFactors=FALSE)
colnames(composer) <- c(1:11)
composer$song_id <- songs$song_id

composer_data <- composer[c(12,1)]
for(i in 2:11){
  tester <- composer[c(12,i)]
  tester <- na.omit(tester)
  names(tester) <- names(composer_data)
  composer_data <- rbind(composer_data, tester)
}
names(composer_data) <- c('song_id', 'composers')
composer_binary <- dcast(composer_data, formula = song_id~composers, fun.aggregate = NULL)
composer_binary[-1][!is.na(composer_binary[-1])] <- 1
composer_binary[is.na(composer_binary)] <- 0

lyricist <- as.data.frame(songs$lyricist, stringsAsFactors=FALSE)
lyricist <- as.data.frame(tstrsplit(lyricist[,1], '[|]', type.convert=TRUE), stringsAsFactors=FALSE)
colnames(lyricist) <- c(1:11)
lyricist$song_id <- songs$song_id

lyricist_data <- lyricist[c(12,1)]
for(i in 2:11){
  tester <- lyricist[c(12,i)]
  tester <- na.omit(tester)
  names(tester) <- names(lyricist_data)
  lyricist_data <- rbind(lyricist_data, tester)
}
names(lyricist_data) <- c('song_id', 'lyricists')
lyricist_binary <- dcast(lyricist_data, formula = song_id~lyricists, fun.aggregate = NULL)
lyricist_binary[-1][!is.na(lyricist_binary[-1])] <- 1
lyricist_binary[is.na(lyricist_binary)] <- 0


songs$song_length_bucket <- cut2(songs$song_length, cuts = quantile(songs$song_length, probs = c(0,0.25,0.75)))
length_data <- songs[c('song_id', 'song_length_bucket')]
length_data <- na.omit(length_data)
length_binary <- dcast(length_data, formula = song_id~song_length_bucket, fun.aggregate = NULL)
length_binary[-1][!is.na(length_binary[-1])] <- 1
length_binary[is.na(length_binary)] <- 0
names(length_binary) <- c('song_id','low', 'med', 'high')

big_flat_item_features <- as.data.frame(cbind(length_binary, 
                                              genre_binary[2:ncol(genre_binary)],
                                              artists_binary[2:ncol(artists_binary)],
                                              composer_binary[2:ncol(composer_binary)],
                                              lyricist_binary[2:ncol(lyricist_binary)],
                                              lang_binary[2:ncol(lang_binary)]))
big_flat_item_features[2:ncol(big_flat_item_features)] <- sapply(big_flat_item_features[2:ncol(big_flat_item_features)], as.numeric)

big_flat_item_features <- readRDS('content_based/big_flat_item_features.rds')
big_flat_item_features[1:3,1:20]

We now have our items represented as feature vectors

knn Classification Model

The nearest neighbor classifier is one of the simplest classification techniques, and it can be implemented in a relatively straightforward way.
We construct model for each user based on his/her ratings for the items. Content based recommender system can be thought of as a classification problem for each user.

total_prediction_matrix <- data.frame()
near_neighbors = 2^c(2:5)
for(active_user in user_key$msno){
  items_rated <- training_data$song_id[training_data$msno == active_user]
  train <- subset(big_flat_item_features, big_flat_item_features$song_id %in% items_rated)
  user_target <- subset(training_data[1:3], training_data$msno == active_user)
  user_target$target <- as.factor(user_target$target)
  train$target <- user_target$target[match(train$song_id, user_target$song_id)]
  train <- train[-1]
  train_knn <- train[1:2035]
  #test_knn <- train[1:2035]
  cl <- train$target
  
  items_not_rated <- testing_data$song_id[testing_data$msno == active_user]
  test <- subset(big_flat_item_features, big_flat_item_features$song_id %in% items_not_rated)
  
  user_test_target <- subset(testing_data, testing_data$msno == active_user)
  test$target <- testing_data$target[match(test$song_id, testing_data$song_id)]
  test_knn = test[2:2036]
  for(nn in near_neighbors){
    preds <- knn(train_knn, test_knn, cl, k = nn, prob=TRUE)
    prediction_matrix <- as.data.frame(cbind(active_user, test$song_id, test$target, attributes(preds)$prob, as.numeric(as.character(preds)), nn))
    names(prediction_matrix) <- c('active_user', 'song_id', 'target', 'prediction_prob', 'prediction', 'nn')
    total_prediction_matrix <- rbind(total_prediction_matrix, prediction_matrix)
    
  }
  print(paste(active_user, "complete"))
}

The overall accuracy for all neighbors included in the content based comes out to be the following.

total_prediction_matrix <- read_rds('content_based/total_prediction_matrix_update.rds')
total_prediction_matrix$score <- as.numeric(as.character(total_prediction_matrix$prediction_prob))
total_prediction_matrix$prediction <- as.numeric(as.character(total_prediction_matrix$prediction))
total_prediction_matrix$score[total_prediction_matrix$prediction == 0] <- 1 - total_prediction_matrix$score[total_prediction_matrix$prediction == 0]
conf_matrix <- table(total_prediction_matrix$target, total_prediction_matrix$prediction)
content_based_accuracy <- (conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
content_based_accuracy

## [1] 0.6880569

We can see that increasing the number of neighbors in our knn-content based recommender, improves the accuracy. The following chart summarizes the accuracy of the content based recommender. We further discuss other accuracy measures in the conclusion.

accuracy_nn <- total_prediction_matrix[c('nn', 'target', 'prediction')]
accuracy_nn$diff <- abs(as.numeric(as.character(accuracy_nn$target)) - as.numeric(as.character(accuracy_nn$prediction)))
accuracy_nn <- accuracy_nn %>% group_by(nn, diff) %>% summarise(count = n())
accuracy_nn$accuracy <- accuracy_nn$count/mean(accuracy_nn$count)/2
accuracy_nn <- subset(accuracy_nn, diff == 0)
accuracy_nn

Evaluation

We compare the content based model with the collaborative filter models and see that it is better than random prediction and outperforms user-based collaborative filtering, however, the number of neighbors in the knn model were restricted to 32, thus under performing item-based collaborative filter.

h <- round(ibcf_accuracy,2)
i <- round(ubcf_accuracy,2)
j <- round(rand_accuracy,2)
nn_plot <- ggplot(data=accuracy_nn, aes(x=nn, y=accuracy, group=1)) +
  geom_line(size=1.2)+
  ggtitle('Error rate of knn-content recommender system against number of \n nearest neighbors. [CF for benchmarking]') +
  geom_hline(aes(yintercept=h), colour="#BB0000", linetype="dashed") + 
  geom_text(aes(0, h, label = paste("ibcf: ",h), vjust = 1, hjust = 0), size = 3) + 
  geom_hline(aes(yintercept=i), colour="#E69F00", linetype="dashed") + 
  geom_text(aes(0, i, label = paste("ubcf: ",i), vjust = 1, hjust = 0), size = 3) + 
  geom_hline(aes(yintercept=j), colour="#56B4E9", linetype="dashed") + 
  geom_text(aes(0, j, label = paste("rand: ",j), vjust = 1, hjust = 0), size = 3) + 
  geom_point(size = 1.5) + 
  ylim(0.62, 0.72)

nn_plot

SVD

Model Summary:

We also explore Singular Value Decompositions to build our model. The SVD algorithm is a matrix factorization techinque to reduce the number of features in a dataset. For a large \(M \times N\) user-item ratings matrix A of rank \(r\), SVD model maps both users and items to a joint latent factor space of \(k\) singular values by decomposing the ratings matrix and reducing its space dimensions from \(r\) to \(k\). The latent factor space tries to explain ratings by characterzing both items and users on factors automatically inferred from user feedback.

The core formula for SVD matrix decomposition is:

\(A = U \Sigma V^T\)

where \(\Sigma\) is the diagonal matrix of rank \(k\) that contains the laten factors, \(U\) and \(V\) each represents the number of users and items.

Here’s an example of SVD dimensionality reduction with user movie ratings. To apply to our dataset, simply switch out movie genres with song categories.

The rule for predicting ratings \(r_{ui}\) for each user-item pair with SVD is :

\(\hat{r}_{ui} = \mu + b_i + b_u + q_i^T p_u\)

where each item \(i\) is represented by vector \(q_i\) and each user \(u\) is associated with vector \(p_u\), \(\mu\) is the average rating, \(b_u\) and \(b_i\) are user bias and item bias.

Our objective function for optimizing model paramters minimizes the regularized squared error:

\(\min_{b_u, b_i,q_i,p_u} \Sigma (r_ui -\mu - b_i - b_u - q_i^T p_u)^2 + \lambda (b_i^2 +b_u^2 +\|q_i\|^2 + \|p_u\|^2)\)

Implementation using stochastics gradient descent (Funk SVD):

The popular implementation method which we are use is optimizing the stochastic gradient descent. The method is first introduced by Simon Funk, hence the name Funk SVD. Since Funk SVD allows missing values in ratings matrix, it is suitable for our data set as the our matrix is quite sparese. Improved upon simple matrix decomposition, it allows us to include regularization term \(\lambda(b_i^2 +b_u^2 +\|q_i\|^2 + \|p_u\|^2)\) and control its by \(\lambda\). For each given rating \(r_ui\) in the training data, we make prediction \(\hat{r}_{ui}\) with reduced latent factors and calculate the prediction error \(e_ui = r_ui - \hat{r}_{ui}\). We update the parameters based on the prediction error as follows:

\(b_u \leftarrow b_u + \gamma (e_{ui} - \lambda b_u)\)
\(b_i \leftarrow b_u + \gamma (e_{ui} - \lambda b_i)\)
\(q_i \leftarrow q_i + \gamma (e_{ui} p_u - \lambda q_i)\)
\(q_i \leftarrow q_i + \gamma (e_{ui} p_u - \lambda q_i)\)

where \(\gamma\) is learning rates, and \(\lambda\) is regularization.

Our Model

First, we convert our data into a \(M \times N\) ratings matrix with all users and songs so that each element \(r_{ui}\) represents whether user \(u\) will listen to song \(i\) again. For missing user-song pair, we replace it with NA. We also replace the known values in the orognal test data with NA so we can make our own predictions. Later on, we will be comparing model predictions with actual values in model evaluation.

Then we split the matrix into training and testing with a train proportion of 0.8.

### Get train and test rating matrix
train_test_matrix <- function(training_data, testing_data, train_prop){
  
  #Replace known values
  testing <- testing_data
  testing[,3]<- rep(NA,nrow(testing))
  all_ratings <- data.frame(bind_rows(training_data[,1:3],testing))
  names(all_ratings) <- c('msno','song_id','target')
  
  #Convert to rating matrix
  rating_matrix <- dcast(all_ratings, msno ~ song_id ,value.var = "target" , na.rm=FALSE)
  rownames(rating_matrix) <- unlist(rating_matrix[,1])#row indexed by msno
  names(rating_matrix) <- names(rating_matrix) #column indexed by song_id
  rating_matrix <- rating_matrix[,-1] #column indexed by song_id
  
  #Split into train and test matrices
  which_split <- sample(x = c(TRUE, FALSE),size = nrow(rating_matrix),replace = TRUE,prob = c(train_prop, 1-train_prop))
  train_matrix <- as.matrix(rating_matrix[which_split, ],rownames.force = NA)
  test_matrix <- as.matrix(rating_matrix[!which_split, ],rownames.force = NA)
  
  return (list('rating_matrix'= rating_matrix, 'test_matrix'=test_matrix,'train_matrix'=train_matrix))
}

Then, we start training our models with the train matrix. For choosing the optimal paramter later, we use 5 different k values.

k= c(5,10,20,30,40)
gamma = 0.015 #learning rates
lambda = 0.001 #regularization

#Train Model with different k values
fsvd_model_5 <- funkSVD(train_matrix, k = 5, gamma = 0.015, lambda = 0.001,min_improvement = 1e-06, min_epochs = 50, max_epochs = 200, verbose = FALSE)
fsvd_model_10 <- funkSVD(train_matrix, k = 10, gamma = 0.015, lambda = 0.001,min_improvement = 1e-06, min_epochs = 50, max_epochs = 200, verbose = FALSE)
fsvd_model_20 <- funkSVD(train_matrix, k = 20, gamma = 0.015, lambda = 0.001,min_improvement = 1e-06, min_epochs = 50, max_epochs = 200, verbose = FALSE)
fsvd_model_30 <- funkSVD(train_matrix, k = 30, gamma = 0.015, lambda = 0.001,min_improvement = 1e-06, min_epochs = 50, max_epochs = 200, verbose = FALSE)
fsvd_model_40 <- funkSVD(train_matrix, k = 40, gamma = 0.015, lambda = 0.001,min_improvement = 1e-06, min_epochs = 50, max_epochs = 200, verbose = FALSE)
fsvd_models <- list(fsvd_model_5,fsvd_model_10,fsvd_model_20,fsvd_model_30,fsvd_model_40)

For each model, our prediction function outputs training error(RMSE), probability predictions, and binarized prediciton based on a threshold of 0.5 probability.

### Predict real score and binary ratings using Funk SVD
predict_svd <-function(fsvd_model, train_matrix, test_matrix) {
  
  #Reconstruct the rating matrix as R = UV',get train error
  predicted_train_prob <- tcrossprod(fsvd_model$U, fsvd_model$V)
  predicted_train_binary <- ifelse(predicted_train_prob >=0.5,1,0)
  dimnames(predicted_train_prob) <- dimnames(train_matrix)
  dimnames(predicted_train_binary) <- dimnames(train_matrix)
  train_rmse <- RMSE(train_matrix, predicted_train_binary)
  
  #Predict ratings for test data with Funk SVD model
  predicted_test_prob <- predict(fsvd_model, test_matrix, verbose = TRUE)
  dimnames(predicted_test_prob) <- dimnames(test_matrix)
  predicted_test_binary <- ifelse(predicted_test_prob >=0.5,1,0)
  dimnames(predicted_test_binary) <- dimnames(test_matrix)
  
  return (list(predicted_train_prob, predicted_train_binary,train_rmse,
               predicted_test_prob,predicted_test_binary))
}
fsvd_model_5_result <- predict_svd(fsvd_model_5, train_matrix, test_matrix)
fsvd_model_10_result <- predict_svd(fsvd_model_10, train_matrix, test_matrix)
fsvd_model_20_result <- predict_svd(fsvd_model_20, train_matrix, test_matrix)
fsvd_model_30_result <- predict_svd(fsvd_model_30, train_matrix, test_matrix)
fsvd_model_40_result <- predict_svd(fsvd_model_40, train_matrix, test_matrix)
fsvd_models_results <- list(fsvd_model_5_result,fsvd_model_10_result,fsvd_model_20_result,fsvd_model_30_result,fsvd_model_40_result)

Tuning Parameter

Latent factor K

We test the SVD model with different number of latent factors \((k=5,8,10,20,40)\) and compared their training RMSE to choose the optimal one. Intuitively, more latent factors will output more accurate results, but we also take into account the training and prediction time, which varies significantly with different k values.

rmse_all <- c('k=5'= fsvd_models_results[[1]][[3]],'k=10'= fsvd_models_results[[2]][[3]],'k=20'= fsvd_models_results[[3]][[3]],'k=30'= fsvd_models_results[[4]][[3]],'k=40'= fsvd_models_results[[5]][[3]])
rmse_all

##       k=5      k=10      k=20      k=30      k=40 
## 0.5331391 0.5225738 0.5220137 0.5215471 0.5210543

From RSME and prediction accuracy, we can see that larger value of k will not improve the model significantly. Therefore, we choose the optimal value of k to be 10.

Prediction Results

Sample output:

head(fsvd_test_results[,1:5],10)

Evaluation

Accuray for testing data and evaluation results

#Prediction Accuracy
err_10 =0
for (k in 1:nrow(fsvd_test_results)){
  err_10 = err_10 + abs(fsvd_test_results$FunkSVD_results_10[k] - fsvd_test_results$target[k])}
1-(err_10/nrow(fsvd_test_results))

## [1] 0.7097649

evaluation_result[[1]] #print evaluation metrics

evaluation_result[[2]] #print confusion matrix

##          actual
## predicted      0      1
##         0  42451  24657
##         1  42661 122174

svd.roc.plot + ggtitle('ROC Curve for SVD Model') +
  theme(plot.title = element_text(hjust = 0.5))

The final accuary for SVD approximation with k = 10 is 0.71. We will examine the details of the evaluation output when we compare this model to the others in the conclusion.

Hybridization

We take a novel approach to ensemble the outputs from Factorization Machines and SVD to design a deep neural network based recommender system and observe that it increases the accuracy of our predictions.
The scores from the SVD and Factorization Machines are taken and we treat it as a classification problem to obtain the hybrid recommendations for the songs.
We build the hybrid recommender model on the testing data part of the original data. We further split the original testing data into new train and test set to build our model.

Data Preparation

load('FM/FM.RData')
user_key <- readRDS('content_based/user_key.rds')
user_key$user_code <- as.character(user_key$user_code)
user_key$msno <- as.character(user_key$msno)
song_key <- readRDS('content_based/song_key.rds')
song_key$song_code <- as.character(song_key$song_code)
song_key$song_id <- as.character(song_key$song_id)
merged_testing$msno <- as.character(merged_testing$msno)
merged_testing$song_id <- as.character(merged_testing$song_id)
merged_testing$user_key <- user_key$msno[match(merged_testing$msno, user_key$user_code)]
merged_testing$song_key <- song_key$song_id[match(merged_testing$song_id, song_key$song_code)]
merged_testing <- cbind(merged_testing, predFM.mcmc)
fm_predictions <- cbind(merged_testing[c('user_key','song_key','target')],predFM.mcmc)
fm_predictions$fm_prediction_class <- 0
fm_predictions$fm_prediction_class[fm_predictions$predFM.mcmc >=0.5] <- 1
fm_predictions <- fm_predictions[c('user_key', 'song_key', 'target', 'predFM.mcmc', 'fm_prediction_class')]
names(fm_predictions) <- c('msno', 'song_id','target',  'predicted_fm_score', 'predicted_fm_class')

svd_model <- readRDS('SVD/FSVD_info/fsvd_test_results.rds')
svd_model$song_id <- as.character(svd_model$song_id)
svd_model$msno <- as.character(svd_model$msno)
svd_model$user_key <- user_key$msno[match(svd_model$msno, user_key$user_code)]
svd_model$song_key <- song_key$song_id[match(svd_model$song_id, song_key$song_code)]
svd_predictions <- svd_model[,c('user_key', 'song_key', 'FunkSVD_scores_10', 'FunkSVD_results_10')]
names(svd_predictions) <- c('msno', 'song_id', 'predicted_svd_score', 'predicted_svd_class')

flat_table <- inner_join(merged_testing, svd_model, by = c("msno", "song_id"))
flat_table <- flat_table[c("msno", "song_id", "target.x", "predFM.mcmc", "FunkSVD_results_10", "FunkSVD_scores_10")]
names(flat_table) <- c('msno', 'song_id', 'target', 'pred_FM_score','pred_SVD_class','pred_SVD_score')
flat_table$target <- as.factor(flat_table$target)
flat_table$pred_FM_class <- 0
flat_table$pred_FM_class[flat_table$pred_FM_score >= 0.5] <- 1

set.seed(666)
split_data <- train_test_split(flat_table,0.75)
train_hybrid <- split_data$train_data
test_hybrid <- split_data$test_data
train_hybrid_feed <- train_hybrid[c("target","pred_FM_score","pred_SVD_score")]
test_hybrid_feed <- test_hybrid[c("pred_FM_score","pred_SVD_score")]

Now our data looks like the following table:

head(flat_table)

Parameter Tuning

We train our deep learning model on various net size and decay rate to come up with the best parameters

nn_size <- c(25,50,100)
decay_rate <- c(0,0.01, 0.001)
accuracy <- data.frame()
for(nn in nn_size){
  for(dr in decay_rate){
    print(nn + dr)
    nn_hybrid <- nnet(target ~. , data=train_hybrid_feed, size = nn, decay = dr, maxit = 500)
    train_hybrid_feed$predicted_values <- as.vector(nn_hybrid$fitted.values)
    train_hybrid_feed$predicted_class <- 0
    train_hybrid_feed$predicted_class[train_hybrid_feed$predicted_values >= 0.5] <- 1
    conf_matrix <- table(train_hybrid_feed$target, train_hybrid_feed$predicted_class)
    train_hybrid_accuracy <- (conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
    train_append <- c(nn, dr, train_hybrid_accuracy)
    accuracy <- rbind(accuracy, train_append)
  }
}
names(accuracy) <- c("size", "decay_rate","train_error")

We can see the training accuracy for ur grid search and test our model with the best parameters that minimize the training error.

accuracy$size <- as.numeric(accuracy$size)
accuracy$decay_rate <- as.numeric(accuracy$decay_rate)
accuracy$train_error <- as.numeric(accuracy$train_error)
accuracy_mat <- dcast(accuracy, size~decay_rate, value.var = 'train_error')
accuracy_mat

Accuracy

We compare the accuracy against the factorization machine output and SVD output and see that there is slight improvement in the accuracy for the neural net hybrid recommender system. We further discuss our results in the next section.

nn_hybrid <- nnet(target ~. , data=train_hybrid_feed, size = 100, decay = 0.01, maxit = 500)
test_hybrid$pred_hybrid <- predict(ann, test_hybrid_feed)
test_hybrid$hybrid_class <- 0
test_hybrid$hybrid_class[test_hybrid$pred_hybrid >= 0.5] <- 1
#saveRDS(test_hybrid, 'test_hybrid.rds')

conf_matrix <- table(test_hybrid$target, test_hybrid$hybrid_class)
hybrid_accuracy <- (conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)

conf_matrix <- table(test_hybrid$target, test_hybrid$pred_SVD_class)
svd_accuracy <- (conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)

conf_matrix <- table(test_hybrid$target, test_hybrid$pred_FM_class)
fm_accuracy <- (conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
paste('hybrid:', hybrid_accuracy)

## [1] "hybrid: 0.748404787362467"

paste("fm:", fm_accuracy)

## [1] "fm: 0.748215086400166"

paste("svd:", svd_accuracy)

## [1] "svd: 0.711482081881834"

Conclusion

We built several state of the art recommender systems in this project, namely Item-Based Collaborative Filtering, User-Based Collaborative Fitlering, Factorization Machines, Content-Based Recommender System and SVD.

Model Comparison

The following table summarizes the model accuracy metrics for all the recommender systems we built in the project.

evaluate2 <- function(pred.target, target){
  accuracy <- sum(pred.target == target)/length(target)
  conf.mat <- table(pred.target, target, dnn = c("predicted","actual"))
  precision <- conf.mat[2,2]/sum(conf.mat[2,])
  recall <- conf.mat[2,2]/sum(conf.mat[,2])
  F1 <- 2 * (precision*recall/(precision + recall))
  eval.metrics <- data.frame(cbind(accuracy, precision, recall, F1))
  return(eval.metrics)
}

all_models_eval <- rbind(
  evaluate2(test_hybrid$hybrid_class, test_hybrid$target),
  FM.eval[[1]][c(1,3:5)], 
  evaluate2(total_prediction_matrix[total_prediction_matrix$nn == 128, 'prediction'], 
            total_prediction_matrix[total_prediction_matrix$nn == 128, 'target']), 
  evaluation_result[[1]][c(1,3:5)], #svd
  evaluate2(results_ibcf$predicted_class, results_ibcf$target),
  evaluate2(results_ubcf$predicted_class, results_ubcf$target)
)
model <- c('Hybrid', 'FM', 'Content', 'SVD', 'IBCF', 'UBCF')

all_models_eval  <- cbind(model, all_models_eval)

all_models_eval

Discussion

We also hybridised Factorization Machines and SVD models to improve the overall accuracy of the recommender, using deep neural networks.
The models are summarized in the following section, and we see that the Factorization Machine model outperformed any single model. We compared the models against the random accuracy [predicting 1 for every user, which translates to recommending all songs to every user] and see that the algorithms are better than random prediction.
In the content-based recommender, we see that increasing the nearest neighbors in knn, increased the accuracy of the model, it is just over SVD performance. However, the cost of building content based recommender is very high, as each user has its own model that is built based on the features of the songs preferred by him/her.
Factorization machines have high recall but lack the precision of some of the other models. We also observe that the model performs better on songs that occur more times in the data, so future work could involve improving the accuracy on less popular songs. Additionally, we could explore field-aware factorization machines.
Upon hybridization, we can observe that the accuracy improves slightly at the cost of recall. The hybrid model has better precision, but lower recall compared to the FM Model.
In this study, our objective is to predict repeat listening events for user song pairs. Since we are not recommending items, per se, the concepts of novelty and serendipity are not applicable to our predictions. However, if our objective was instead to recommend songs to users, then we could update our content-based model to make novel recommendations by assigning different weights to variables in the item-feature space. This would allow us to construct new features which are a combination of likeable features for the particular user. This would be a good path to explore in future work.

Music Recommendation - Hybrid Recommender System using Factorization Machines, Content Based Recommenders and SVD

Columbia University IEOR 4571 - Personalization Theory and Applications - Final Report Part II

Gaurav Singh, Carolyn Silverman, Cindy Wu, Maura Fitzerald

17 December 2017

Introduction and Objectives

Music Recommendation Rationale

Dataset and Exploration

Training Data

Song count and User count vs target

Song count vs Target

User count vs Target

Target is unbalanced

Members

Distribution of city, bd(age), gender

Age

City

Gender

Songs

Top Items

Top 100 Artists

Top 100 Lyricist

Top 100 composer

Top 100 Language

Top Genre’s

Distribution of song length

Collaborative Filter

Item Based CF

User Based CF

Baseline Accuracy Measures

Factorization Machines

Model Summary:

Tuning Hyperparameters

Choosing k:

Choosing initialization standard deviation

FM Model

Evaluation

Content Based Recommender

Preprocessing and Feature Extraction

knn Classification Model

Evaluation

SVD

Model Summary:

Implementation using stochastics gradient descent (Funk SVD):

Our Model

Tuning Parameter

Latent factor K

Prediction Results

Evaluation

Hybridization

Data Preparation

Parameter Tuning

Accuracy

Conclusion

Model Comparison

Discussion

Columbia University

IEOR 4571 - Personalization Theory and Applications - Final Report Part II