Music is not only able to affect your mood. listening to particularly happy or sad music can even change the way we perceive the world, according to researchers from the University of Groningen. In this modern world we have an ability to choose what music we want to listen easily. Some music player platform such as Spotify are known to its music recommender system. where they recommend music based on their customer historical or genre preferences individually. It will be a new idea if music can be enjoyed by its lyric and will get recommendations based on the mood of the lyrics.

Background

Objective

This project is based on this kaggle dataset. The dataset contains 250k lyric with Valence value gathered using Spotify API. Valence is A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). Our task in this article is to perform supervised NLP sentiment analysis to measure the positiveness of a song. This kind of analysis can be used for the Spotify company itself to improve its music recommender system based on lyric (words).

Limitation: Languange are wide and complex. NLP are also known to its high computational value. So in this analysis i will only use english song lyrics and sampled the data to only 45k songs.

Let’s begin

Data Import

## Observations: 158,353
## Variables: 5
## $ X      <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ artist <fct> Elijah Blake, Elijah Blake, Elijah Blake, Elijah Blake, Elij...
## $ seq    <fct> "No, no\nI ain't ever trapped out the bando\nBut oh Lord, do...
## $ song   <fct> Everyday, Live Till We Die, The Otherside, Pinot, Shadows & ...
## $ label  <dbl> 0.6260, 0.6300, 0.2400, 0.5360, 0.3710, 0.3210, 0.6010, 0.33...

Data Wrangling

Feature Engineering

## [1] No, no\nI ain't ever trapped out the bando\nBut oh Lord, don't get me wrong\nI know a couple niggas that do\nI'm from a place where everybody knows your name\nThey say I gotta watch my attitude\nWhen they see money, man they all start actin' strange\nSo fuck with the ones that fuck with you\nThey can never say I'm brand new\n\nIt's everyday, everyday\nEveryday, everyday, everyday\nEveryday, everyday\nEveryday, everyday\nI've been talkin' my shit, nigga that's regular\nI've been lovin' 'em thick, life is spectacular\nI spend like I'ma die rich, nigga I'm flexin', yeah\nEveryday, that's everyday\nThat's everyday\nThat's everyday\nThat's everyday, everyday\n\nI see all of these wanna-be hot R&B singers\nI swear you all sound the same\nThey start from the bottom, so far from the motto\nYou niggas'll never be Drake\nShout out to OVO\nMost of them prolly don't know me though\nI stay in the cut, I don't fuck with no\nBody but I D, that's a pun on No I.D\nWhen nobody know my name\nRunnin' for my dream wasn't hard to do\nYou break bread, I swear they all pull out a plate\nEat with the ones who starved with you\nIf I'm winnin' then my crew can't lose\n\nIt's everyday, everyday\nEveryday, everyday, everyday\nEveryday, everyday\nEveryday, everyday\nI've been talkin' my shit, nigga that's regular\nI've been lovin' 'em thick, life is spectacular\nI spend like I'ma die rich, nigga I'm flexin', yeah\nEveryday, that's everyday\nThat's everyday\nThat's everyday\nThat's everyday, everyday\n\nI heard since you got money\nYou changed, you're actin' funny\nThat's why I gets on my lonely\nYou be lovin' when change is a hobby\nWho do you dress when you ain't got nobody?\n\nIt's everyday, everyday\nEveryday, everyday, everyday\nEveryday, everyday\nEveryday, everyday\nI've been talkin' my shit, nigga that's regular\nI've been lovin' 'em thick, life is spectacular\nI spend like I'ma die rich, nigga I'm flexin', yeah\nEveryday, that's everyday\nThat's everyday\nThat's everyday\nThat's everyday, everyday
## 135645 Levels: ''Do you want... to have... a tasty... mushroom?' ...

The lyrics are stored in seq coloumn. as you can see it will need a lot of treatment before modeling. The simplest thing we can do first is to remove “ n” as its new line break. The target column (label) still in numeric format. as i said before, higher valence (label) value means the songs are considered as positive mood and lower valence means negative mood. I’ll convert the valence value to binary value labeled positive and negative with 0.5 as center value. I also want to filter english only lyrics to perform the NLP easier. I’ll use function from cld2 package to detect the lyric languange.

let’s see how our data has changed

Next, due to my machine limitation, i only use 45k songs for analysis. the songs are selected from random sampling

Modeling

Naive Bayes

Modeling using NB need special treatment in the train data. the column represents words and each row represents one single song. NB doesnt need the exact number of each words, it only need to know if the words are present in the song or not. thus, we convert the value in each cell to contain either 1 or 0. 1 means this specific word is present in the song, 0 means not present.

Use bernoulli converter to convert any value above 0 to 1 and 0 to remain 0.

apply bernoulli converter to train and test data

0 in a cell indicates the song doesn’t have a particular word. it also means that corresponding class-feature combination has a 0 probability of occuring. it will ruin the NB algorithm which computes the conditional a-posterior probabilities of a categorical class variable given independent predictor variables using the Bayes rule. We could specify laplace=1 to enable an add-one smoothing.

create confusion matrix for later evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative     4363     2733
##   positive     1464     2690
##                                                
##                Accuracy : 0.6269               
##                  95% CI : (0.6179, 0.6359)     
##     No Information Rate : 0.518                
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.2468               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.4960               
##             Specificity : 0.7488               
##          Pos Pred Value : 0.6476               
##          Neg Pred Value : 0.6149               
##              Prevalence : 0.4820               
##          Detection Rate : 0.2391               
##    Detection Prevalence : 0.3692               
##       Balanced Accuracy : 0.6224               
##                                                
##        'Positive' Class : positive             
## 

Decision Tree

Next we will build another model using different algorithm. we will use Decision tree and MARS. but before that, we need to make data frame with cleaned data. The token value will not be converted to 1 or 0 like naive bayes. it’ll remain original.

splitting

create confusion matrix for later evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative     3671     2404
##   positive     2188     2986
##                                               
##                Accuracy : 0.5918              
##                  95% CI : (0.5826, 0.6009)    
##     No Information Rate : 0.5208              
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.1808              
##                                               
##  Mcnemar's Test P-Value : 0.00151             
##                                               
##             Sensitivity : 0.5540              
##             Specificity : 0.6266              
##          Pos Pred Value : 0.5771              
##          Neg Pred Value : 0.6043              
##              Prevalence : 0.4792              
##          Detection Rate : 0.2654              
##    Detection Prevalence : 0.4600              
##       Balanced Accuracy : 0.5903              
##                                               
##        'Positive' Class : positive            
## 

Mars

next we build 3rd model using MARS algorithm

create confusion matrix for later use

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative     4403     2880
##   positive     1456     2510
##                                                
##                Accuracy : 0.6145               
##                  95% CI : (0.6055, 0.6236)     
##     No Information Rate : 0.5208               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.2195               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.4657               
##             Specificity : 0.7515               
##          Pos Pred Value : 0.6329               
##          Neg Pred Value : 0.6046               
##              Prevalence : 0.4792               
##          Detection Rate : 0.2231               
##    Detection Prevalence : 0.3526               
##       Balanced Accuracy : 0.6086               
##                                                
##        'Positive' Class : positive             
## 

Random Forest

create confusion matrix for later use

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative     4140     2019
##   positive     1719     3371
##                                                
##                Accuracy : 0.6677               
##                  95% CI : (0.6589, 0.6764)     
##     No Information Rate : 0.5208               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3328               
##                                                
##  Mcnemar's Test P-Value : 0.000001006          
##                                                
##             Sensitivity : 0.6254               
##             Specificity : 0.7066               
##          Pos Pred Value : 0.6623               
##          Neg Pred Value : 0.6722               
##              Prevalence : 0.4792               
##          Detection Rate : 0.2997               
##    Detection Prevalence : 0.4525               
##       Balanced Accuracy : 0.6660               
##                                                
##        'Positive' Class : positive             
## 

Model Tuning

Sadly i’m not satisfy about the result. the highest Accuracy only 66.75%. i will try to tune some models in hope we can get better result

Decision Tree

in decision tree we can do some parameter tuning like cost_complexity, tree_depth, and min_n. This time, we will do a grid tuning for tree_depth and min_n by given number. we’ll do the grid search 5 times with 3 k-fold cross validation

The grid tuning really takes a lot of time. My pc RAM can’t even load this notebook before i clean the rhistory and environment. So i can’t save the output but i ensure you the best parameters are tree_depth = 23 and min_n = 20.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative     3671     2404
##   positive     2188     2986
##                                               
##                Accuracy : 0.5918              
##                  95% CI : (0.5826, 0.6009)    
##     No Information Rate : 0.5208              
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.1808              
##                                               
##  Mcnemar's Test P-Value : 0.00151             
##                                               
##             Sensitivity : 0.5540              
##             Specificity : 0.6266              
##          Pos Pred Value : 0.5771              
##          Neg Pred Value : 0.6043              
##              Prevalence : 0.4792              
##          Detection Rate : 0.2654              
##    Detection Prevalence : 0.4600              
##       Balanced Accuracy : 0.5903              
##                                               
##        'Positive' Class : positive            
## 

Random Forest

in Random Forest we can do some parameter tuning like trees,and mtry. This time, we will do a grid tuning for number of trees and mtry by given number. we’ll do the grid search 4 times with 3 k-fold cross validation.

The tuning take a lots of time. the parameter for best results are mtry = 6 and trees = 550. here’s i show you the code but for time efficiency ill exclude it and load the pre-build model instead

build confusion matrix for evaluating

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative     4117     1981
##   positive     1742     3409
##                                                
##                Accuracy : 0.669                
##                  95% CI : (0.6603, 0.6777)     
##     No Information Rate : 0.5208               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3357               
##                                                
##  Mcnemar's Test P-Value : 0.00009596           
##                                                
##             Sensitivity : 0.6325               
##             Specificity : 0.7027               
##          Pos Pred Value : 0.6618               
##          Neg Pred Value : 0.6751               
##              Prevalence : 0.4792               
##          Detection Rate : 0.3030               
##    Detection Prevalence : 0.4579               
##       Balanced Accuracy : 0.6676               
##                                                
##        'Positive' Class : positive             
## 

We’ve got verry little improvement from accuracy 66.7 to 66.9

Model Evaluation and conclusion

Let’s combine all the confusion matrix to make the evaluation easier

Since there’s no urgenity in this case, we will choose Accuracy as our high-priority metric to solve the case. User can easily remove or skip if they dont like the recommended songs and it will not affect our operational cost. Positive song in sad song playlist will not harm anyone but its better if we try avoid it.

As we can see from the table above, Random Forest tuned model has the highest Accuracy. It’ll always possible to have higher accuracy (or other metrics) if we try another classification model. We’ll do that in the future. So in conclusion, we’ll use Random Forest model to predict song’s mood based on its lyric.

Predicting new given lyric

we only cover approximately 45k songs. there’s thousand if not million songs worldwide and it’s such a shame if we can predict the mood given the song’s lyric. so here we will try to build a function to suit a plain new lyric text into our model. The data will be cleaned up automaitcaly before we predict their mood.

here i will use a song from One Piece OST opening 3 titled ‘hikari e’ (to the light). the song is originally japanese but i translate it to match our builded model.

Model type_1

here’s the function. its just all the cleaning step combined into one function and build new data frame as the output. it also matching words as predictor variable to required column names (word in this case) in train data.

let’s predict the mood using mars model

Our Mars model predict it as negative mood music.

Model type_2

Random forest algorithm can’t copy special character column names like for,breaks,and next so we build a different function for it. the difference is only in dictionary names its follow the modified column names in train_tune and test_tune

The random forest model (best model in this case) also predict the lyric as negative-mood song.

Naive Bayes

we’ve different format for Naive Bayes. we’ll also build function to clean up the text and matching it to Naive Bayes requirement. the only different is in the last step we apply benoulli converter and return it as transposed matrix.

Predict the lyrics using naive bayes model

## [1] negative
## Levels: negative positive

Naive Bayes Model are also predict the lyric as negative mood music. if you hear the real song, its actually a spirit, energic, and positive mood music but i never know what’s the lyrics actually say.

thank you!