Synopsis

Deezer, a music streaming app, proposed an online challenge https://inclass.kaggle.com/c/dsg17-online-phase/. The objective of this challenge is to get the perfect music recommendation for the users by providing a prediction model indicating whether the test dataset listened to the first track Flow proposed them or not.

In this study, we provide a description of our potential approach in order to answer the following questions:

What would you do to win this competition?
What would you do solve the problem they are interested in?
Why the two solutions might not overlap or why they do?

Introduction

In order to answer the questions mentioned above, our approach is:

To win this competition, the model that we build should predict well only the provided test set in test.scv.
To solve the problem, we need to build a predictive model wich generalizes well and can predict correctly any unseen observations.

Cleaning and exploration of the data

We start by summarizing the training and testing data

mydataTrain = read.csv("train.csv")
summary(mydataTrain)

##     genre_id        ts_listen            media_id        
##  Min.   :     0   Min.   :1.000e+00   Min.   :   200058  
##  1st Qu.:     0   1st Qu.:1.478e+09   1st Qu.: 13766137  
##  Median :     3   Median :1.479e+09   Median : 93806596  
##  Mean   :  2245   Mean   :1.479e+09   Mean   : 78396237  
##  3rd Qu.:    27   3rd Qu.:1.480e+09   3rd Qu.:126259195  
##  Max.   :259731   Max.   :1.481e+09   Max.   :137260128  
##     album_id         context_type    release_date      platform_name   
##  Min.   :    1976   Min.   : 0.00   Min.   :19000101   Min.   :0.0000  
##  1st Qu.: 1255566   1st Qu.: 0.00   1st Qu.:20091231   1st Qu.:0.0000  
##  Median : 9525626   Median : 1.00   Median :20141031   Median :0.0000  
##  Mean   : 8136169   Mean   : 2.36   Mean   :20113878   Mean   :0.4732  
##  3rd Qu.:13292211   3rd Qu.: 2.00   3rd Qu.:20160607   3rd Qu.:1.0000  
##  Max.   :14720858   Max.   :73.00   Max.   :30000101   Max.   :2.0000  
##  platform_family  media_duration     listen_type      user_gender    
##  Min.   :0.0000   Min.   :    0.0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:  196.0   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :  222.0   Median :0.0000   Median :0.0000  
##  Mean   :0.2558   Mean   :  231.2   Mean   :0.3069   Mean   :0.3937  
##  3rd Qu.:0.0000   3rd Qu.:  254.0   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :2.0000   Max.   :65535.0   Max.   :1.0000   Max.   :1.0000  
##     user_id        artist_id           user_age      is_listened   
##  Min.   :    0   Min.   :       1   Min.   :18.00   Min.   :0.000  
##  1st Qu.:  899   1st Qu.:    2605   1st Qu.:21.00   1st Qu.:0.000  
##  Median : 2738   Median :  194172   Median :25.00   Median :1.000  
##  Mean   : 4037   Mean   : 1500740   Mean   :24.31   Mean   :0.684  
##  3rd Qu.: 6064   3rd Qu.: 1519461   3rd Qu.:28.00   3rd Qu.:1.000  
##  Max.   :19917   Max.   :11447410   Max.   :30.00   Max.   :1.000

mydataTest = read.csv("test.csv")
summary(mydataTest)

##    sample_id        genre_id        ts_listen            media_id        
##  Min.   :    0   Min.   :     0   Min.   :1.474e+09   Min.   :   208864  
##  1st Qu.: 4979   1st Qu.:     0   1st Qu.:1.479e+09   1st Qu.: 12494287  
##  Median : 9958   Median :     7   Median :1.480e+09   Median : 84791735  
##  Mean   : 9958   Mean   :  1827   Mean   :1.480e+09   Mean   : 73913024  
##  3rd Qu.:14938   3rd Qu.:    27   3rd Qu.:1.480e+09   3rd Qu.:121728646  
##  Max.   :19917   Max.   :194411   Max.   :1.481e+09   Max.   :136932296  
##     album_id         context_type     release_date      platform_name   
##  Min.   :    7782   Min.   : 1.000   Min.   :19000101   Min.   :0.0000  
##  1st Qu.: 1132614   1st Qu.: 1.000   1st Qu.:20091204   1st Qu.:0.0000  
##  Median : 8543341   Median : 1.000   Median :20140616   Median :0.0000  
##  Mean   : 7632604   Mean   : 2.112   Mean   :20112467   Mean   :0.4444  
##  3rd Qu.:12721422   3rd Qu.: 1.000   3rd Qu.:20160324   3rd Qu.:1.0000  
##  Max.   :14657320   Max.   :23.000   Max.   :20161231   Max.   :2.0000  
##  platform_family media_duration  listen_type      user_gender    
##  Min.   :0.000   Min.   :  25   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.: 198   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :0.000   Median : 223   Median :1.0000   Median :0.0000  
##  Mean   :0.179   Mean   : 233   Mean   :0.9999   Mean   :0.4332  
##  3rd Qu.:0.000   3rd Qu.: 254   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :2.000   Max.   :4126   Max.   :1.0000   Max.   :1.0000  
##     user_id        artist_id           user_age    
##  Min.   :    0   Min.   :       1   Min.   :18.00  
##  1st Qu.: 4979   1st Qu.:    2350   1st Qu.:20.00  
##  Median : 9958   Median :  164704   Median :24.00  
##  Mean   : 9958   Mean   : 1396082   Mean   :23.85  
##  3rd Qu.:14938   3rd Qu.: 1429782   3rd Qu.:27.00  
##  Max.   :19917   Max.   :11364782   Max.   :30.00

The variable release_date of the song in the training set shows a maximum value equal to 30000101 which does not make any sense. We limit our data to the maximum of release_date in the testing data (20161231).

mydataTrain <- mydataTrain[!(mydataTrain$release_date > 20161231),]

In the following of this section, we describe suggestions only for answering question 1.

For the listen_type variable in the testing set, which takes the values 1 or 0, the first quantile and the maximum are equal.

sum(mydataTest$listen_type == 0)

## [1] 1

This shows that the testing set contains only 1 observation with listen_type = 0. Thus, removing all training data with listen_type variable equal to 0 will result in a simpler model to train which focuses on predicting the proposed test set. However, we need to check whether the user_id to be removed (row) exists or not with a listen_type = 1. If it does not exist, we keep the user_id with a listen_type = 0 for at least to have tracks of each user—the list of distinct users in the train dataset matches exactly with the test dataset’s one.

This last point could be used when we will try to win the competition.

Another bias observed in the data is with the variable ts_listen which is the timestamp of the listening in UNIX time

# Convert Any Input to Parsed Date or Datetime
library(anytime)

anytime(min(mydataTrain$ts_listen))

## [1] "1969-12-31 19:00:01 EST"

anytime(max(mydataTrain$ts_listen))

## [1] "2016-12-01 18:44:05 EST"

anytime(min(mydataTest$ts_listen))

## [1] "2016-09-12 07:49:59 EDT"

anytime(max(mydataTest$ts_listen))

## [1] "2016-12-01 18:55:31 EST"

As we can see above, the range of ts_listen for the training set is much larger than the testing set, which covers only about three months. We can keep the data in the training set that covers only a few years before the maximum ts_listen given by the testing set (2016-12-01). We also make sure that each user_id exist in both train and test sets.

This last point could be used when we will try to win the competition.

Prediction

Random forest, which is an ensemble learning model, could be used in order to target to win this competition. This prediction model is difficult to interpret and slow to train, however, it is very accurate. Indeed it is one of the most widely used and highly accurate for prediction in a competition like Kaggle.

To use the random forest model, one can use the caret package on R, or scikit-learn package on Python which both implement random forest.

In order to solve the problem, the prediction should generalize well for any unseen observation at any time. One approach that could be used here is to split the training set, after randomization, on three smaller sets: training, validation, and testing. Then, we train the model on the training set. Next, we evaluate the model on the validation set. If we are not satisfied with the accuracy of the validation results, we tweak the model according to those results and re-iter with the training phase. If we reach the desired accuracy, we confirm with the testing points.

Using random forest could be computationally very expensive for generalization because the training set contains a very large number of data (more than 7.5 m). Other prediction models could be used here such as artificial neural networks using Tensorflow (https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf) which is suitable for large scale machine learning.

Conclusion

This report provides the first suggestions to tackle the proposed Kaggle challenge. The main difference between the solutions of questions 1 and 2 is on the generalization of the predictive model. In addition, simplifications are done based on the distribution of the testing set in order to win the competition. This solution would probably not generalize well if different test distribution set is used. For our best of knowledge, Kaggle evaluates the performance of the model on the provided testing set only. In order to generalize the model (question 2), additional steps would be necessary such as splitting the training set into three smaller sets for training, validation, and testing. Because of the very large number of training data in this case, predictive models that scale well with the number of training observations are needed.

In our opinion, the two solutions would not overlap in our case because:

- We adapt the model in question 1 for solving the proposed problem focusing on the distribution of the testing set, which allows some simplifications.
- We adapt the model in question 2 to generalize the model. More efforts are needed in this case.

More investigations in the data could be done, for example, identifying the correlation between the variables if they exist. Other data manipulations could be realized especially for question 1, for example, weighting the data in the training set instead of removing them.

Exercise

Mohamed Amine Bouhlel

3/20/2019

Synopsis

Introduction

Cleaning and exploration of the data

Prediction

Conclusion