DATA 607 Final Project: Prediction Systems

Introduction

In this project, we will be using data from kaggle.com and participating in a current competition to create a better music recommendation system. We will predict the chances of a user listening to a song repetitively after the first observable listening event within one month of first hearing the song. If there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, its target is marked 1, and otherwise it is marked 0 in the training set.

KKBOX provides a training data set consisting of information of the first observable listening event for each unique user-song pair within a specific time duration. Metadata of each unique user and song pair is also provided. The use of public data to increase the level of accuracy of the prediction is encouraged.

The training and the test data are selected from users’ listening history in a given time period. Note that this time period is chosen to be before the WSDM-KKBox Churn Prediction time period. The training and test sets are split based on time, and the split of public/private are based on unique user/song pairs.

The link to the dataset : https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data

Our Final target is to build the csv file with Row_Id for the songs and its target value.
Target = 1 refers to a listener who listened to a song within a month of the first observed listening event

Target = 0 refers to a listener who does not listen to the song again within a month of first hearing it

Loading Libraries

library(ggplot2)
library(dplyr)
library(RMySQL)
library(readr)
library(reshape2)
library(tidyr)
library(stringr)
library(rpart)
library(RColorBrewer)
library(rpart.plot)
library(rattle)
library(recommenderlab)

Loading Tables

Load the Members table directly from MySQL database
Load the following csv tables
- Songs
- Extra song information
- Training data set: Broken into components
  - to build the prediction model
  - test it against known target values
- Testing data set

Step I : Data Cleansing and Operations

Loading Members Table from MySQL

#Read password for mysql from file on computer
sqlpass <- read_file("C:/Users/Swigo/Desktop/Sarah/sqlpassword.txt") 

members_db <- dbConnect(RMySQL::MySQL(), user='root', password=sqlpass, dbname='songs', host='localhost', port=3306) #accessing mysql database

membertable <- dbSendQuery(members_db, "SELECT * FROM members;")  
members <- fetch(membertable, n=-1) #n=-1 retrieves all pending records
dbDisconnect(members_db)

## [1] TRUE

members <- members[1:10000,]

members$age <- gsub(0, NA, members$age) #replace age that is recorded as zero with NA
head(members)

##                                        user_id city  age gender
## 1 ++5wYjoMgQHoRuD3GbbvmphZbBBwymzv5Q4l8sywtuU=   13   39   male
## 2 ++AH7m/EQ4iKe6wSlfO/xXAJx50p+fCeTyF90GoE9Pg=    9   21 female
## 3 ++e+jsxuQ8UEnmW40od9Rq3rW7+wAum4wooXyZTKJpk=   14   31   male
## 4 ++TMf3Z81LFn6nXHm753GvkF2DhAgQwRV5zn0yQsX40=    1 <NA>       
## 5 ++xWL5Pbi2CpG4uUugigQahauM0J/sBIRloTNPBybIU=   17   18   male
## 6 +/SKX44s4ryWQzYzuV7ZKMXqIKQMN1cPz3M8CJ8CFKU=    1 <NA>       
##   registration_method registration_time expiration_date
## 1                   9          20060218        20171021
## 2                   4          20151120        20170101
## 3                   7          20140801        20170505
## 4                   4          20170219        20170222
## 5                   3          20121209        20170129
## 6                   9          20111218        20170907

Loading Extra Song Info

song_extra_info <- read.csv("https://raw.githubusercontent.com/swigodsky/Music-Recommender-System/master/song_extra_infob.csv",header=TRUE, stringsAsFactors=FALSE)
song_extra_info <- song_extra_info[1:4000,]

Loading Song Data Table

songs <- read.csv("https://raw.githubusercontent.com/swigodsky/Music-Recommender-System/master/songsd.csv", header=FALSE, stringsAsFactors = FALSE)

names(songs) = as.character(unlist(songs[1,]))
songs <- songs[-1,]
songs<-songs[1:10000,]
colnames(songs) <- c("song_id", "song_length", "genre_id", "artist_name", "composer", "lyricist", "language")
head(songs)

##                                        song_id song_length        genre_id
## 2 CXoTN1eb7AI+DntdU1vbcwGRV4SCIDxZu+YD8JP8r4E=      247640             465
## 3 o0kFgae9QtnYgRkVPqLJwa05zIhRlUjfF7O1tDw0ZDU=      197328             444
## 4 DwVvVurfpuz+XPuFvucclVQEyPqcpUkHR0ne1RQzPs0=      231781             465
## 5 dKMBWoZyScdxSkihKG+Vf47nc18N9q4m58+b4e7dSSE=      273554             465
## 6 W3bqWd3T+VeHFzHAUfARgW9AvVRaF4N5Yzm4Mr6Eo/o=      140329             726
## 7 kKJ2JNU5h8rphyW21ovC+RZU+yEHPM+3w85J37p7vEQ=      235520 864|857|850|843
##              artist_name                           composer       lyricist
## 2 å¼µä¿¡å<U+0093>² (Jeff Chang)                             è<U+0091>£è²<U+009E>      ä½<U+0095>å<U+0095><U+009F>å¼<U+0098>
## 3              BLACKPINK TEDDY|  FUTURE BOUNCE|  Bekuh BOOM          TEDDY
## 4           SUPER JUNIOR                                                  
## 5                  S.H.E                          æ¹¯å°åº·      å¾ä¸<U+0096>ç
## 6           è²´æ<U+0097>ç²¾é¸                        Traditional    Traditional
## 7           è²´æ<U+0097>ç²¾é¸                       Joe Hisaishi Hayao Miyazaki
##   language
## 2        3
## 3       31
## 4       31
## 5        3
## 6       52
## 7       17

Loading Test Data - This Is Used For Submission To Kaggle

testdata <- read.csv("https://raw.githubusercontent.com/swigodsky/Music-Recommender-System/master/test.csv", stringsAsFactors = FALSE)
testdata <-testdata[1:4000,]
colnames(testdata) <- c("row_id", "user_id", "song_id", "system_tab", "layout_seen", "entry_source")

Loading Training Data - This is Used To Build and Test The Prediction Model

trainingdata <- read.csv("https://raw.githubusercontent.com/swigodsky/Music-Recommender-System/master/trainb.csv", stringsAsFactors = FALSE)

testwithtargetknown <-trainingdata[6001:8000,] #this data frame will be used to test our model
colnames(testwithtargetknown) <- c("user_id", "song_id", "system_tab", "layout_seen", "entry_source", "target")

trainingdata <-trainingdata[1:6000,]
colnames(trainingdata) <- c("user_id", "song_id", "system_tab", "layout_seen", "entry_source", "target")
head(trainingdata)

##                                        user_id
## 1 FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=
## 2 Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=
## 3 Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=
## 4 Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=
## 5 FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=
## 6 FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=
##                                        song_id system_tab
## 1 BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=    explore
## 2 bhp/MpSNoqoxOIB+/l8WPqu6jldth4DIpCm3ayXnJqM= my library
## 3 JNWfrrC7zNN7BdMpsISKa4Mw+xVJYNnxXh3/Epw7QgY= my library
## 4 2A87tzfnJTSWqD7gIZHisolhe4DMdzkbd6LzO1KHjNs= my library
## 5 3qm6XTZ6MOCU11x8FIVbAGH5l5uMkT3/ZalWG1oo2Gc=    explore
## 6 3Hg5kugV1S0wzEVLAEfqjIV5UHzb7bCrdBRQlGygLvU=    explore
##           layout_seen    entry_source target
## 1             Explore online-playlist      1
## 2 Local playlist more  local-playlist      1
## 3 Local playlist more  local-playlist      1
## 4 Local playlist more  local-playlist      1
## 5             Explore online-playlist      1
## 6             Explore online-playlist      1

Step II : Merging Operations for model building

Merge Training Sets

merger1 <- merge(trainingdata,songs,"song_id", all.x = TRUE)  
#all.x=TRUE gives left outer join
merger_train <- merge(merger1,members,"user_id", all.x = TRUE)

merger2 <- merge(testwithtargetknown,songs, by = "song_id", all.x=TRUE)  
merger_train2 <- merge(merger2,members, by = "user_id", all.x=TRUE)

View Merged Training Data Set

knitr::kable(head(merger_train))

user_id	song_id	system_tab	layout_seen	entry_source	target	song_length	genre_id	artist_name	composer	lyricist	language	city	age	registration_method	registration_time	expiration_date
/1SwgVHh8c46pmuq1WlHnRz1maIqTYncZZmihvFJZtk=	bYqstmYZYnvJ5kc92UFbTw7OE8+Qih3p6qgm+vjpzwo=	my library	Local playlist more	local-library	1	303960	958	äº¤é¿æäººå¤¢			-1	1	NA	7	20131225	20170916
/1SwgVHh8c46pmuq1WlHnRz1maIqTYncZZmihvFJZtk=	BAcidAOnsF/W93uUvNBBYOSXevkVhOVbpat2+mhz520=	search	Artist more	top-hits-for-artist	0	NA	NA	NA	NA	NA	NA	1	NA	7	20131225	20170916
/1SwgVHh8c46pmuq1WlHnRz1maIqTYncZZmihvFJZtk=	vll4VeRJoQA78n2KNpZXJXDNoJTVSD1iijuoNaxE1+c=	search	Artist more	top-hits-for-artist	0	NA	NA	NA	NA	NA	NA	1	NA	7	20131225	20170916
/1SwgVHh8c46pmuq1WlHnRz1maIqTYncZZmihvFJZtk=	LZC3fGHeMT7iQ5gjlHefXMFk3dQW9L8+1rd8epi0PwM=	search	Artist more	top-hits-for-artist	0	NA	NA	NA	NA	NA	NA	1	NA	7	20131225	20170916
/1SwgVHh8c46pmuq1WlHnRz1maIqTYncZZmihvFJZtk=	+A07bYnnS0PsjaN9ZeRMMpmJaoTUOA2a2A8OEK8luOk=	search	Artist more	top-hits-for-artist	0	NA	NA	NA	NA	NA	NA	1	NA	7	20131225	20170916
/1SwgVHh8c46pmuq1WlHnRz1maIqTYncZZmihvFJZtk=	6haAXG7SXbxo+kq1W9KQQhqdKZ2UdhGBkQ2vKBkuMxY=	search	Artist more	top-hits-for-artist	0	NA	NA	NA	NA	NA	NA	1	NA	7	20131225	20170916

Merge Test sets

merger3 <- merge.data.frame(testdata,songs,by = "song_id", all.x = TRUE)
merger_test <- merge.data.frame(merger3,members,by = "user_id", all.x = TRUE)

The following chunk is written to get top 10 song_id listened by users in descending orders.

user_based <- data.frame(table(merger_train2$user_id))
song_based <- data.frame(table(merger2$song_id)) 
colnames(user_based) <- c("user_id","Frequency")
colnames(song_based) <- c("song_id","Frequency1")

user_based <- arrange(user_based,desc(Frequency))
user_based <- as.data.frame(user_based)
user_based$user_id <- as.character(user_based$user_id)


song_based <- arrange(song_based,desc(Frequency1))
song_based <- as.data.frame(song_based)

top_n(song_based,10,Frequency1)

##                                         song_id Frequency1
## 1  PgRtmmESVNtWjoZHO5a1r21vIz9sVZmcJJpFCbRa1LI=         14
## 2  43Qm2YzsP99P5wm37B1JIhezUcQ/1CDjYlQx6rBbz2U=         11
## 3  J4qKkLIoW7aYACuTupHLAPZYmRp08en1AEux+GSUzdw=         10
## 4  FynUyq0+drmIARmK1JZ/qcjNZ7DKkqTY6/0O0lTzNUI=          7
## 5  IKMFuL0f5Y8c63Hg9BXkeNJjE0z8yf3gMt/tOxF4QNE=          7
## 6  T86YHdD4C9JSc274b1IlMkLuNdz4BQRB50fWWE7hx9g=          7
## 7  +LztcJcPEEwsikk6+K5udm06XJQMzR4+lzavKLUyE0k=          6
## 8  3DU6F6k6dFSdoQa2tsia5spMXFCxWh4JJdA3OxSo3rM=          6
## 9  Ce8Mui+yVruumWy9pKMDfSRBFg1m93m/+XR8FHlktaE=          6
## 10 DhuNF1B0GK4zzZ7U+v/vt/Y+3aQbTRmV2bd99TUAXzI=          6
## 11 izWkEacMHX6lc+hEC3QnWwWd2LXiHZ9tacUNpfUz+aw=          6
## 12 L6w2d0w84FjTvFr+BhMfgu7dZAsGiOqUGmvvxIG3gvQ=          6
## 13 TbrfWJpl81E4alsi2+b84oeUu+b5S/8n9yFlZDriNYs=          6

Step III : Developement of model

Bar Graph to Show The Number of Listeners By Target Value

ggplot(merger_train,aes(x= target))+ theme_bw(base_size = 16) + theme(axis.text.x=element_text(angle=90,hjust=1)) + geom_bar(color = "red")

Considering The Proportion of Users Who Listen to a Song More Than Once Within One Month of First Hearing It - Target = 1

prop.table(table(merger_train$target))

## 
##         0         1 
## 0.2126667 0.7873333

Interesting!! This shows that 78.73% of listeners listened to songs within a month of first hearing it. This means we can assume that most listeners will have a target of 1.

Beginning To Build The Model By Setting the Target for Every Listener in the Test Group Equal to 1

merger_test$target <- rep(1,4000)
merger_train2$targetguess <- rep(1, 2000)

Gender - target model

The graph gives a general idea of what gender group is more prevalent in the data set.

ggplot(data=subset(merger_train, !is.na(gender)), aes(x=gender)) + geom_bar(stat="count")

There are more females than males in the data set, and there are a high percentage of listeners whose gender in unknown.

prop.table(table(merger_train$gender, merger_train$target),1)

##         
##                  0         1
##          0.2026049 0.7973951
##   female 0.1411765 0.8588235
##   male   0.2740899 0.7259101

86% of women who listened to a song, listened to it a second time. About 73% of men who listened to a song, listened to it a second time.

System Tab - Target Model

Bar Plot to View Differences for Listening Preferences According to System Tab

ggplot(merger_train,aes(x= system_tab))+ theme_bw(base_size = 16) + theme(axis.text.x=element_text(angle=90,hjust=1)) + geom_bar(colour = "red")

It is clear that people will prefer to listen songs from their own library and have a greater probability to listen to those songs again.

prop.table(table(merger_train$system_tab, merger_train$target),1)

##               
##                        0         1
##                0.2142857 0.7857143
##   discover     0.3755061 0.6244939
##   explore      0.4081633 0.5918367
##   listen with  0.6304348 0.3695652
##   my library   0.1398678 0.8601322
##   notification 0.0000000 1.0000000
##   radio        0.7580645 0.2419355
##   search       0.4947917 0.5052083

The system tab is the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. The target is zero for 75.80% of people whose system tab is radio and 63.04% of the people whose system tab is listen with. System tab will be helpful for making predictions.

Setting the Target to 0 for system_tab=radio and system_tab=listen with

merger_test$target[merger_test$system_tab=='radio'|merger_test$system_tab=='listen with'] <- 0

merger_train2$targetguess[merger_train2$system_tab=='radio'|merger_train2$system_tab=='listen with'] <- 0

Age - Target Model

merger_train$younger <- 0
merger_train$younger[merger_train$age < 30] <-1
prop.table(table(merger_train$younger, merger_train$target),1)

##    
##             0         1
##   0 0.2098765 0.7901235
##   1 0.2439024 0.7560976

Different age cutoffs were tested, and at every age, the percentage of people who repeatedly listened to a song was approximately the same. 76% of people below and 79% of people above 30 years old listened to a song again within one month of first hearing it.

Taking into account every parameter we decided two develop two models for our final predictions

Gender and System Tab - Target Model

aggregate(target ~ system_tab + gender, data=merger_train, FUN=function(x) {sum(x)/length(x)})

##      system_tab gender     target
## 1                      1.00000000
## 2      discover        0.69343066
## 3       explore        0.33333333
## 4   listen with        0.50000000
## 5    my library        0.88493724
## 6  notification        1.00000000
## 7         radio        0.25000000
## 8        search        0.37735849
## 9      discover female 0.77319588
## 10  listen with female 0.00000000
## 11   my library female 0.89340102
## 12        radio female 1.00000000
## 13       search female 0.64285714
## 14     discover   male 0.42105263
## 15      explore   male 0.50000000
## 16  listen with   male 0.18181818
## 17   my library   male 0.88472622
## 18        radio   male 0.19047619
## 19       search   male 0.03448276

There are a number of helpful predictions that can be make based on this data. 100% of females and 19% of males with a system tab of radio had a target of 1. No females and 18% of males with a system tab of ‘listen with’ had a target of 1, but 50% of people whose gender was not listed had a target of 1 with that system tab. 64% of females and 3% of males with the search system tab had a target of 1.

Adjusting the Prediction to take into account listen with and female, male and search

merger_test$target[merger_test$gender=='female' & merger_test$system_tab=='radio'] <- 1
merger_test$target[merger_test$gender=='male'& merger_test$system_tab=='search'] <- 0

merger_train2$targetguess[merger_train2$gender=='female' & merger_train2$system_tab=='radio'] <- 1
merger_train2$targetguess[merger_train2$gender=='male'& merger_train2$system_tab=='search'] <- 0

Comparing the Target to Our Guess of the Target

(sum(merger_train2$target == merger_train2$targetguess))/2000

## [1] 0.8005

The target is correctly predicted 80.05% of the time.

Decision Tree

fit <- rpart(target ~  gender + system_tab + entry_source, data=merger_train, method="class")
fancyRpartPlot(fit)

The decision tree separates by entry source and then by gender.

Using the Decision Tree to Predict the Target Value

Prediction <- predict(fit, merger_train2, type = "class")
merger_train2$predict <- Prediction
(sum(merger_train2$target == merger_train2$predict))/2000

## [1] 0.804

Using the decision tree, the target was predicted 80.40% of the time.

Recommender Lab model

Second model to get the desired sample_submission file for our development of shiny app is made with ‘Recommender lab’ model

As working with song_id was a tedious process,We gave sequential numbers “Row_id” to our train set for better understanding. [Song_id’s are bit lengthy here]

## Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
##   ..@ data     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. ..@ i       : int [1:12000] 0 1 2 3 4 5 6 7 8 9 ...
##   .. .. ..@ p       : int [1:4] 0 6000 7276 12000
##   .. .. ..@ Dim     : int [1:2] 6000 3
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : chr [1:3] "Row_id" "0" "1"
##   .. .. ..@ x       : num [1:12000] 0 0 0 0 0 0 0 0 0 0 ...
##   .. .. ..@ factors : list()
##   ..@ normalize:List of 1
##   .. ..$ row:List of 2
##   .. .. ..$ method : chr "center"
##   .. .. ..$ factors:List of 2
##   .. .. .. ..$ means: num [1:6000] 1 2 3 4 5 6 7 8 9 10 ...
##   .. .. .. ..$ sds  : NULL

## [[1]]
## Row_id      1 
##      0      0 
## 
## [[2]]
## Row_id      0 
##      0      0 
## 
## [[3]]
## Row_id      0 
##      0      0 
## 
## [[4]]
## Row_id      0 
##      0      0 
## 
## [[5]]
## Row_id      0 
##      0      0 
## 
## [[6]]
## Row_id      0 
##      0      0

Heatmap for raw-targets and Normalized targets

image(r,main = "Raw Targets")

image(r_m,main = "Normalized Targets")

Turning matrix into binary model

r_b <- binarize(r,minRating = 1)
head(as(r_b,"matrix"))

##   Row_id     0     1
## 1   TRUE FALSE  TRUE
## 2   TRUE  TRUE FALSE
## 3   TRUE  TRUE FALSE
## 4   TRUE  TRUE FALSE
## 5   TRUE  TRUE FALSE
## 6   TRUE  TRUE FALSE

We used “UBCF” (User-based collaborative filtering) for the predictions.

rec <- Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=25))
recom <- predict(rec,r[1:nrow(r)],type = "topNList",n = 10)
Row_id <- testdata$row_id 
rec_list <- as(recom,"list")
submission_file <- data.frame(matrix(unlist(rec_list[1:4000])))
colnames(submission_file) <- "Recommended_target"
submission_file1 <- as.data.frame(cbind(Row_id,submission_file))
head(submission_file1)

##   Row_id Recommended_target
## 1      0                  0
## 2      1                  1
## 3      2                  1
## 4      3                  1
## 5      4                  1
## 6      5                  1

write.table(submission_file1,file = "submission.csv",row.names=FALSE,col.names=TRUE,sep=',')

Final submission file gives a very clear picture of row_ids and their required targets, which was the official requirement of project!!

Further development to help users with our results

In our attempt to build an application through shiny and help users with our result ,a small attempt is made to develop the application.

Challenges

Using Very Large Data Sets
Missing Data
We have different versions of R so some of our code doesn’t run on both of our computers
Wanting to Sort By Song Or Listener, but Having Too Many Different Options

Conclusions

Entry Source and System Tab were the most useful predictors for determining whether a listener will listen to a song again within a month of first hearing it
The Decision Tree is easy To create in R and very useful for creating a predictive model
Working with a data set through kaggle was a positive experience because there were so many resources on the site itself about recommender systems