DATA 607 Final Project: Prediction Systems

2017-12-10

Introduction

Using data from kaggle.com and participating in a current competition to create a better music recommendation system.
The goal is to predict the chances of a user listening to a song within a month of the listener's first observable listening event.
Through kaggle KKBOX, Asia's leading music streaming service, a training data set with information about songs and listeners is provided.
Target = 1 listener who listened to a song within a month of the first observed listening event
Target = 0 listener who does not listen to the song again within a month of first hearing it

Loading Tables

Load the Members table directly from MySQL database
Load the following csv tables
- Songs
- Extra song information
- Training data set: Broken into components
  - to build the prediction model
  - test it against known target values
- Testing data set

Merge Training Sets

merger1 <- merge(trainingdata,songs,"song_id", all.x = TRUE)  
#all.x=TRUE gives left outer join
merger_train <- merge(merger1,members,"user_id", all.x = TRUE)

merger2 <- merge(testwithtargetknown,songs, by = "song_id", all.x=TRUE)  
merger_train2 <- merge(merger2,members, by = "user_id", all.x=TRUE)

Bar Graph to Show The Number of Listeners By Target Value

ggplot(merger_train,aes(x= target))+ theme_bw(base_size = 16) + theme(axis.text.x=element_text(angle=90,hjust=1)) + geom_bar(colour = "purple")

Considering The Proportion of Users Who Listen to a Song More Than Once Within One Month of First Hearing It - Target = 1

prop.table(table(merger_train$target))

## 
##         0         1 
## 0.2126667 0.7873333

This shows that 78.73% of listeners listened to songs within a month of first hearing it. This means we can assume that most listeners will have a target of 1.

Beginning To Build The Model By Setting the Target for Every Listener in the Test Group Equal to 1

merger_test$target <- rep(1,4000)
merger_train2$targetguess <- rep(1, 2000)

Gender - target model

prop.table(table(merger_train$gender, merger_train$target),1)

##         
##                  0         1
##          0.2026049 0.7973951
##   female 0.1411765 0.8588235
##   male   0.2740899 0.7259101

86% of women who listened to a song, listened to it a second time.
73% of men who listened to a song, listened to it a second time.

System Tab - Target Model

Bar Plot to View Differences for Listening Preferences According to System Tab

ggplot(merger_train,aes(x= system_tab))+ theme_bw(base_size = 16) + theme(axis.text.x=element_text(angle=90,hjust=1)) + geom_bar(colour = "red")

It is clear that people will prefer to listen songs from their own library and have a greater probability to listen to those songs again.

System Tab

prop.table(table(merger_train$system_tab, merger_train$target),1)

##               
##                        0         1
##                0.2142857 0.7857143
##   discover     0.3755061 0.6244939
##   explore      0.4081633 0.5918367
##   listen with  0.6304348 0.3695652
##   my library   0.1398678 0.8601322
##   notification 0.0000000 1.0000000
##   radio        0.7580645 0.2419355
##   search       0.4947917 0.5052083

The system tab is the name of the tab where the event was triggered
76% of people whose system tab is radio do not listen to a song again within a month of first hearing it
63.04% of the people whose system tab is listen withdo not listen to a song again within a month of first hearing it

Gender and System Tab - Target Model

aggregate(target ~ system_tab + gender, data=merger_train, FUN=function(x) {sum(x)/length(x)})

##      system_tab gender     target
## 1                      1.00000000
## 2      discover        0.69343066
## 3       explore        0.33333333
## 4   listen with        0.50000000
## 5    my library        0.88493724
## 6  notification        1.00000000
## 7         radio        0.25000000
## 8        search        0.37735849
## 9      discover female 0.77319588
## 10  listen with female 0.00000000
## 11   my library female 0.89340102
## 12        radio female 1.00000000
## 13       search female 0.64285714
## 14     discover   male 0.42105263
## 15      explore   male 0.50000000
## 16  listen with   male 0.18181818
## 17   my library   male 0.88472622
## 18        radio   male 0.19047619
## 19       search   male 0.03448276

There are a number of helpful predictions that can be make based on this data. 100% of females and 19% of males with a system tab of radio had a target of 1. No females and 18% of males with a system tab of 'listen with' had a target of 1, but 50% of people whose gender was not listed had a target of 1 with that system tab. 64% of females and 3% of males with the search system tab had a target of 1.

Comparing the Target to Our Guess of the Target

(sum(merger_train2$target == merger_train2$targetguess))/2000

## [1] 0.8005

The target is correctly predicted 80.05% of the time.

Decision Tree

fit <- rpart(target ~  gender + system_tab + entry_source, data=merger_train, method="class")
fancyRpartPlot(fit)

The decision tree separates by entry source and then by gender.

Using the Decision Tree to Predict the Target Value

Prediction <- predict(fit, merger_train2, type = "class")
merger_train2$predict <- Prediction
(sum(merger_train2$target == merger_train2$predict))/2000

## [1] 0.804

Using the decision tree, the target was predicted 80.40% of the time.

Recommender Lab model

Second model to get the desired sample_submission file for our development of a shiny app is made with 'Recommender lab' model

As working with song_id was a tedious process,We gave sequential numbers "Row_id" to our train set for better understanding. [Song_id's are bit lengthy here]

Heatmap for raw-targets and Normalized targets

image(r,main = "Raw Targets")

image(r_m,main = "Normalized Targets")

Turning matrix into binary model

r_b <- binarize(r,minRating = 1)
head(as(r_b,"matrix"))

##   Row_id     0     1
## 1   TRUE FALSE  TRUE
## 2   TRUE  TRUE FALSE
## 3   TRUE  TRUE FALSE
## 4   TRUE  TRUE FALSE
## 5   TRUE  TRUE FALSE
## 6   TRUE  TRUE FALSE

We used "UBCF" (User-based collaborative filtering) for the predictions.

rec <- Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=25))
recom <- predict(rec,r[1:nrow(r)],type = "topNList",n = 10)
Row_id <- testdata$row_id 
rec_list <- as(recom,"list")
submission_file <- data.frame(matrix(unlist(rec_list[1:4000])))
colnames(submission_file) <- "Recommended_target"
submission_file1 <- as.data.frame(cbind(Row_id,submission_file))
head(submission_file1)

##   Row_id Recommended_target
## 1      0                  0
## 2      1                  1
## 3      2                  1
## 4      3                  1
## 5      4                  1
## 6      5                  1

write.table(submission_file1,file = "submission.csv",row.names=FALSE,col.names=TRUE,sep=',')

Final submission file gives a very clear picture of row_ids and their required targets, which was the official requirement of project!!

Further development to help users with our results

In our attempt to build an application through shiny and help users with our result ,a small attempt is made to develop the application.

Challenges

Using Very Large Data Sets
Missing Data
We have different versions of R so some of our code doesn't run on both of our computers
Wanting to Create a Model By That Includes Song Or Listener, but Having Too Many Different Options

Conclusions

Entry Source, Gender and System Tab were the most useful predictors for determining whether a listener will listen to a song again within a month of first hearing it
The Decision Tree is easy to create in R and very useful for creating a predictive model
Working with a data set through kaggle was a positive experience because there were so many resources on the site itself about recommender systems