Deezer, a music streaming app, proposed an online challenge https://inclass.kaggle.com/c/dsg17-online-phase/. The objective of this challenge is to get the perfect music recommendation for the users by providing a prediction model indicating whether the test dataset listened to the first track Flow proposed them or not.
In this study, we provide a description of our potential approach in order to answer the following questions:
What would you do to win this competition?
What would you do solve the problem they are interested in?
Why the two solutions might not overlap or why they do?
In order to answer the questions mentioned above, our approach is:
We start by summarizing the training and testing data
mydataTrain = read.csv("train.csv")
summary(mydataTrain)
## genre_id ts_listen media_id
## Min. : 0 Min. :1.000e+00 Min. : 200058
## 1st Qu.: 0 1st Qu.:1.478e+09 1st Qu.: 13766137
## Median : 3 Median :1.479e+09 Median : 93806596
## Mean : 2245 Mean :1.479e+09 Mean : 78396237
## 3rd Qu.: 27 3rd Qu.:1.480e+09 3rd Qu.:126259195
## Max. :259731 Max. :1.481e+09 Max. :137260128
## album_id context_type release_date platform_name
## Min. : 1976 Min. : 0.00 Min. :19000101 Min. :0.0000
## 1st Qu.: 1255566 1st Qu.: 0.00 1st Qu.:20091231 1st Qu.:0.0000
## Median : 9525626 Median : 1.00 Median :20141031 Median :0.0000
## Mean : 8136169 Mean : 2.36 Mean :20113878 Mean :0.4732
## 3rd Qu.:13292211 3rd Qu.: 2.00 3rd Qu.:20160607 3rd Qu.:1.0000
## Max. :14720858 Max. :73.00 Max. :30000101 Max. :2.0000
## platform_family media_duration listen_type user_gender
## Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 196.0 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median : 222.0 Median :0.0000 Median :0.0000
## Mean :0.2558 Mean : 231.2 Mean :0.3069 Mean :0.3937
## 3rd Qu.:0.0000 3rd Qu.: 254.0 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :2.0000 Max. :65535.0 Max. :1.0000 Max. :1.0000
## user_id artist_id user_age is_listened
## Min. : 0 Min. : 1 Min. :18.00 Min. :0.000
## 1st Qu.: 899 1st Qu.: 2605 1st Qu.:21.00 1st Qu.:0.000
## Median : 2738 Median : 194172 Median :25.00 Median :1.000
## Mean : 4037 Mean : 1500740 Mean :24.31 Mean :0.684
## 3rd Qu.: 6064 3rd Qu.: 1519461 3rd Qu.:28.00 3rd Qu.:1.000
## Max. :19917 Max. :11447410 Max. :30.00 Max. :1.000
mydataTest = read.csv("test.csv")
summary(mydataTest)
## sample_id genre_id ts_listen media_id
## Min. : 0 Min. : 0 Min. :1.474e+09 Min. : 208864
## 1st Qu.: 4979 1st Qu.: 0 1st Qu.:1.479e+09 1st Qu.: 12494287
## Median : 9958 Median : 7 Median :1.480e+09 Median : 84791735
## Mean : 9958 Mean : 1827 Mean :1.480e+09 Mean : 73913024
## 3rd Qu.:14938 3rd Qu.: 27 3rd Qu.:1.480e+09 3rd Qu.:121728646
## Max. :19917 Max. :194411 Max. :1.481e+09 Max. :136932296
## album_id context_type release_date platform_name
## Min. : 7782 Min. : 1.000 Min. :19000101 Min. :0.0000
## 1st Qu.: 1132614 1st Qu.: 1.000 1st Qu.:20091204 1st Qu.:0.0000
## Median : 8543341 Median : 1.000 Median :20140616 Median :0.0000
## Mean : 7632604 Mean : 2.112 Mean :20112467 Mean :0.4444
## 3rd Qu.:12721422 3rd Qu.: 1.000 3rd Qu.:20160324 3rd Qu.:1.0000
## Max. :14657320 Max. :23.000 Max. :20161231 Max. :2.0000
## platform_family media_duration listen_type user_gender
## Min. :0.000 Min. : 25 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.: 198 1st Qu.:1.0000 1st Qu.:0.0000
## Median :0.000 Median : 223 Median :1.0000 Median :0.0000
## Mean :0.179 Mean : 233 Mean :0.9999 Mean :0.4332
## 3rd Qu.:0.000 3rd Qu.: 254 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :2.000 Max. :4126 Max. :1.0000 Max. :1.0000
## user_id artist_id user_age
## Min. : 0 Min. : 1 Min. :18.00
## 1st Qu.: 4979 1st Qu.: 2350 1st Qu.:20.00
## Median : 9958 Median : 164704 Median :24.00
## Mean : 9958 Mean : 1396082 Mean :23.85
## 3rd Qu.:14938 3rd Qu.: 1429782 3rd Qu.:27.00
## Max. :19917 Max. :11364782 Max. :30.00
The variable release_date of the song in the training set shows a maximum value equal to 30000101 which does not make any sense. We limit our data to the maximum of release_date in the testing data (20161231).
mydataTrain <- mydataTrain[!(mydataTrain$release_date > 20161231),]
In the following of this section, we describe suggestions only for answering question 1.
For the listen_type variable in the testing set, which takes the values 1 or 0, the first quantile and the maximum are equal.
sum(mydataTest$listen_type == 0)
## [1] 1
This shows that the testing set contains only 1 observation with listen_type = 0. Thus, removing all training data with listen_type variable equal to 0 will result in a simpler model to train which focuses on predicting the proposed test set. However, we need to check whether the user_id to be removed (row) exists or not with a listen_type = 1. If it does not exist, we keep the user_id with a listen_type = 0 for at least to have tracks of each user—the list of distinct users in the train dataset matches exactly with the test dataset’s one.
This last point could be used when we will try to win the competition.
Another bias observed in the data is with the variable ts_listen which is the timestamp of the listening in UNIX time
# Convert Any Input to Parsed Date or Datetime
library(anytime)
anytime(min(mydataTrain$ts_listen))
## [1] "1969-12-31 19:00:01 EST"
anytime(max(mydataTrain$ts_listen))
## [1] "2016-12-01 18:44:05 EST"
anytime(min(mydataTest$ts_listen))
## [1] "2016-09-12 07:49:59 EDT"
anytime(max(mydataTest$ts_listen))
## [1] "2016-12-01 18:55:31 EST"
As we can see above, the range of ts_listen for the training set is much larger than the testing set, which covers only about three months. We can keep the data in the training set that covers only a few years before the maximum ts_listen given by the testing set (2016-12-01). We also make sure that each user_id exist in both train and test sets.
This last point could be used when we will try to win the competition.
Random forest, which is an ensemble learning model, could be used in order to target to win this competition. This prediction model is difficult to interpret and slow to train, however, it is very accurate. Indeed it is one of the most widely used and highly accurate for prediction in a competition like Kaggle.
To use the random forest model, one can use the caret package on R, or scikit-learn package on Python which both implement random forest.
In order to solve the problem, the prediction should generalize well for any unseen observation at any time. One approach that could be used here is to split the training set, after randomization, on three smaller sets: training, validation, and testing. Then, we train the model on the training set. Next, we evaluate the model on the validation set. If we are not satisfied with the accuracy of the validation results, we tweak the model according to those results and re-iter with the training phase. If we reach the desired accuracy, we confirm with the testing points.
Using random forest could be computationally very expensive for generalization because the training set contains a very large number of data (more than 7.5 m). Other prediction models could be used here such as artificial neural networks using Tensorflow (https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf) which is suitable for large scale machine learning.
This report provides the first suggestions to tackle the proposed Kaggle challenge. The main difference between the solutions of questions 1 and 2 is on the generalization of the predictive model. In addition, simplifications are done based on the distribution of the testing set in order to win the competition. This solution would probably not generalize well if different test distribution set is used. For our best of knowledge, Kaggle evaluates the performance of the model on the provided testing set only. In order to generalize the model (question 2), additional steps would be necessary such as splitting the training set into three smaller sets for training, validation, and testing. Because of the very large number of training data in this case, predictive models that scale well with the number of training observations are needed.
In our opinion, the two solutions would not overlap in our case because:
- We adapt the model in question 1 for solving the proposed problem focusing on the distribution of the testing set, which allows some simplifications.
- We adapt the model in question 2 to generalize the model. More efforts are needed in this case.
More investigations in the data could be done, for example, identifying the correlation between the variables if they exist. Other data manipulations could be realized especially for question 1, for example, weighting the data in the training set instead of removing them.