
2017-12-10
merger1 <- merge(trainingdata,songs,"song_id", all.x = TRUE) #all.x=TRUE gives left outer join merger_train <- merge(merger1,members,"user_id", all.x = TRUE) merger2 <- merge(testwithtargetknown,songs, by = "song_id", all.x=TRUE) merger_train2 <- merge(merger2,members, by = "user_id", all.x=TRUE)
ggplot(merger_train,aes(x= target))+ theme_bw(base_size = 16) + theme(axis.text.x=element_text(angle=90,hjust=1)) + geom_bar(colour = "purple")
prop.table(table(merger_train$target))
## ## 0 1 ## 0.2126667 0.7873333
This shows that 78.73% of listeners listened to songs within a month of first hearing it. This means we can assume that most listeners will have a target of 1.
merger_test$target <- rep(1,4000) merger_train2$targetguess <- rep(1, 2000)
prop.table(table(merger_train$gender, merger_train$target),1)
## ## 0 1 ## 0.2026049 0.7973951 ## female 0.1411765 0.8588235 ## male 0.2740899 0.7259101
ggplot(merger_train,aes(x= system_tab))+ theme_bw(base_size = 16) + theme(axis.text.x=element_text(angle=90,hjust=1)) + geom_bar(colour = "red")
It is clear that people will prefer to listen songs from their own library and have a greater probability to listen to those songs again.
prop.table(table(merger_train$system_tab, merger_train$target),1)
## ## 0 1 ## 0.2142857 0.7857143 ## discover 0.3755061 0.6244939 ## explore 0.4081633 0.5918367 ## listen with 0.6304348 0.3695652 ## my library 0.1398678 0.8601322 ## notification 0.0000000 1.0000000 ## radio 0.7580645 0.2419355 ## search 0.4947917 0.5052083
aggregate(target ~ system_tab + gender, data=merger_train, FUN=function(x) {sum(x)/length(x)})
## system_tab gender target ## 1 1.00000000 ## 2 discover 0.69343066 ## 3 explore 0.33333333 ## 4 listen with 0.50000000 ## 5 my library 0.88493724 ## 6 notification 1.00000000 ## 7 radio 0.25000000 ## 8 search 0.37735849 ## 9 discover female 0.77319588 ## 10 listen with female 0.00000000 ## 11 my library female 0.89340102 ## 12 radio female 1.00000000 ## 13 search female 0.64285714 ## 14 discover male 0.42105263 ## 15 explore male 0.50000000 ## 16 listen with male 0.18181818 ## 17 my library male 0.88472622 ## 18 radio male 0.19047619 ## 19 search male 0.03448276
There are a number of helpful predictions that can be make based on this data. 100% of females and 19% of males with a system tab of radio had a target of 1. No females and 18% of males with a system tab of 'listen with' had a target of 1, but 50% of people whose gender was not listed had a target of 1 with that system tab. 64% of females and 3% of males with the search system tab had a target of 1.
(sum(merger_train2$target == merger_train2$targetguess))/2000
## [1] 0.8005
The target is correctly predicted 80.05% of the time.
fit <- rpart(target ~ gender + system_tab + entry_source, data=merger_train, method="class") fancyRpartPlot(fit)
The decision tree separates by entry source and then by gender.
Prediction <- predict(fit, merger_train2, type = "class") merger_train2$predict <- Prediction (sum(merger_train2$target == merger_train2$predict))/2000
## [1] 0.804
Using the decision tree, the target was predicted 80.40% of the time.
Second model to get the desired sample_submission file for our development of a shiny app is made with 'Recommender lab' model
As working with song_id was a tedious process,We gave sequential numbers "Row_id" to our train set for better understanding. [Song_id's are bit lengthy here]
image(r,main = "Raw Targets")
image(r_m,main = "Normalized Targets")
r_b <- binarize(r,minRating = 1) head(as(r_b,"matrix"))
## Row_id 0 1 ## 1 TRUE FALSE TRUE ## 2 TRUE TRUE FALSE ## 3 TRUE TRUE FALSE ## 4 TRUE TRUE FALSE ## 5 TRUE TRUE FALSE ## 6 TRUE TRUE FALSE
rec <- Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=25)) recom <- predict(rec,r[1:nrow(r)],type = "topNList",n = 10) Row_id <- testdata$row_id rec_list <- as(recom,"list") submission_file <- data.frame(matrix(unlist(rec_list[1:4000]))) colnames(submission_file) <- "Recommended_target" submission_file1 <- as.data.frame(cbind(Row_id,submission_file)) head(submission_file1)
## Row_id Recommended_target ## 1 0 0 ## 2 1 1 ## 3 2 1 ## 4 3 1 ## 5 4 1 ## 6 5 1
write.table(submission_file1,file = "submission.csv",row.names=FALSE,col.names=TRUE,sep=',')
Final submission file gives a very clear picture of row_ids and their required targets, which was the official requirement of project!!
In our attempt to build an application through shiny and help users with our result ,a small attempt is made to develop the application.