Music is not only able to affect your mood. listening to particularly happy or sad music can even change the way we perceive the world, according to researchers from the University of Groningen. In this modern world we have an ability to choose what music we want to listen easily. Some music player platform such as Spotify
are known to its music recommender system. where they recommend music based on their customer historical or genre preferences individually. It will be a new idea if music can be enjoyed by its lyric and will get recommendations based on the mood of the lyrics.
This project is based on this kaggle dataset. The dataset contains 250k lyric with Valence value gathered using Spotify API. Valence is A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). Our task in this article is to perform supervised NLP sentiment analysis to measure the positiveness of a song. This kind of analysis can be used for the Spotify company itself to improve its music recommender system based on lyric (words).
Limitation: Languange are wide and complex. NLP are also known to its high computational value. So in this analysis i will only use english song lyrics and sampled the data to only 45k songs.
You can load the package into your workspace using the library()
function
## Observations: 158,353
## Variables: 5
## $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ artist <fct> Elijah Blake, Elijah Blake, Elijah Blake, Elijah Blake, Elij...
## $ seq <fct> "No, no\nI ain't ever trapped out the bando\nBut oh Lord, do...
## $ song <fct> Everyday, Live Till We Die, The Otherside, Pinot, Shadows & ...
## $ label <dbl> 0.6260, 0.6300, 0.2400, 0.5360, 0.3710, 0.3210, 0.6010, 0.33...
## [1] No, no\nI ain't ever trapped out the bando\nBut oh Lord, don't get me wrong\nI know a couple niggas that do\nI'm from a place where everybody knows your name\nThey say I gotta watch my attitude\nWhen they see money, man they all start actin' strange\nSo fuck with the ones that fuck with you\nThey can never say I'm brand new\n\nIt's everyday, everyday\nEveryday, everyday, everyday\nEveryday, everyday\nEveryday, everyday\nI've been talkin' my shit, nigga that's regular\nI've been lovin' 'em thick, life is spectacular\nI spend like I'ma die rich, nigga I'm flexin', yeah\nEveryday, that's everyday\nThat's everyday\nThat's everyday\nThat's everyday, everyday\n\nI see all of these wanna-be hot R&B singers\nI swear you all sound the same\nThey start from the bottom, so far from the motto\nYou niggas'll never be Drake\nShout out to OVO\nMost of them prolly don't know me though\nI stay in the cut, I don't fuck with no\nBody but I D, that's a pun on No I.D\nWhen nobody know my name\nRunnin' for my dream wasn't hard to do\nYou break bread, I swear they all pull out a plate\nEat with the ones who starved with you\nIf I'm winnin' then my crew can't lose\n\nIt's everyday, everyday\nEveryday, everyday, everyday\nEveryday, everyday\nEveryday, everyday\nI've been talkin' my shit, nigga that's regular\nI've been lovin' 'em thick, life is spectacular\nI spend like I'ma die rich, nigga I'm flexin', yeah\nEveryday, that's everyday\nThat's everyday\nThat's everyday\nThat's everyday, everyday\n\nI heard since you got money\nYou changed, you're actin' funny\nThat's why I gets on my lonely\nYou be lovin' when change is a hobby\nWho do you dress when you ain't got nobody?\n\nIt's everyday, everyday\nEveryday, everyday, everyday\nEveryday, everyday\nEveryday, everyday\nI've been talkin' my shit, nigga that's regular\nI've been lovin' 'em thick, life is spectacular\nI spend like I'ma die rich, nigga I'm flexin', yeah\nEveryday, that's everyday\nThat's everyday\nThat's everyday\nThat's everyday, everyday
## 135645 Levels: ''Do you want... to have... a tasty... mushroom?' ...
The lyrics are stored in seq coloumn. as you can see it will need a lot of treatment before modeling. The simplest thing we can do first is to remove “ n” as its new line break. The target column (label) still in numeric format. as i said before, higher valence (label) value means the songs are considered as positive mood and lower valence means negative mood. I’ll convert the valence value to binary value labeled positive
and negative
with 0.5 as center value. I also want to filter english only lyrics to perform the NLP easier. I’ll use function from cld2
package to detect the lyric languange.
dat$seq <- str_replace_all(as.character(dat$seq), "\n"," ")
# valence with > 0.5 will labelled as potiive, < 0.5 negative
dat$mood <- ifelse(dat$label > 0.5, "positive","negative")
dat$lang <- cld2::detect_language(dat$seq)
# filter the data to english lyric only
dat <- dat[dat$lang == "en",]
let’s see how our data has changed
Next, due to my machine limitation, i only use 45k songs for analysis. the songs are selected from random sampling
textclean
x stringr
x tm
in text cleaning, i’m more familiar with tm
and stringr
package but i also want to learn textclean
(from tidytext) magic. i’ll use both package to clean my text data before modeling
dat <- dat %>%
mutate(text_clean = seq %>% # select seq column
str_to_lower() %>% # convert all the string to low alphabet
replace_contraction() %>% # replace contraction to their multi-word forms
replace_internet_slang() %>% # replace internet slang to normal words
replace_word_elongation() %>% # reolace informal writing with known semantic replacements
replace_number(remove = T) %>% # remove number
replace_date(replacement = "") %>% # remove date
str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
str_squish() %>% # reduces repeated whitespace inside a string.
str_trim() # removes whitespace from start and end of string
)
# convert text data to corpus using tm package then do tokenize
corp <- VCorpus(VectorSource(dat$text_clean))
corp_dtm <- corp %>%
# use pre-build english stopwords
tm_map(removeWords, stopwords("en")) %>%
tm_map(stemDocument) %>%
# convert corpus to document term matrix
DocumentTermMatrix()
Next i want to limit the number of word to be used in modeling. i’ll only choose words that appear in 850 songs
Modeling using NB need special treatment in the train data. the column represents words and each row represents one single song. NB doesnt need the exact number of each words, it only need to know if the words are present in the song or not. thus, we convert the value in each cell to contain either 1 or 0. 1 means this specific word is present in the song, 0 means not present.
# split the data. 75% for train data, and 25% for test data
set.seed(1502)
index <- sample(1:nrow(dat.dtm), 0.75*nrow(dat.dtm))
train_x <- dat.dtm[index,]
test_x <- dat.dtm[-index,]
# subset label/target variable
train_label <- dat[index,"mood"]
test_label <- dat[-index,"mood"]
Use bernoulli converter to convert any value above 0 to 1 and 0 to remain 0.
apply bernoulli converter to train and test data
0 in a cell indicates the song doesn’t have a particular word. it also means that corresponding class-feature combination has a 0 probability of occuring. it will ruin the NB algorithm which computes the conditional a-posterior probabilities of a categorical class variable given independent predictor variables using the Bayes rule. We could specify laplace=1 to enable an add-one smoothing.
# train the model
mod.nb <- naiveBayes(train_x, as.factor(train_label), laplace = 1)
# predict to test data
pred.nb <- predict(mod.nb, test_x,
type = "class")
pred.nb.x <- cbind(data.frame(pred.nb),test_label)%>%
setNames(c("pred","actual"))
create confusion matrix for later evaluation
cf.nb <- confusionMatrix(data = pred.nb.x$pred,
reference = pred.nb.x$actual,
positive = "positive")
cf.nb
## Confusion Matrix and Statistics
##
## Reference
## Prediction negative positive
## negative 4363 2733
## positive 1464 2690
##
## Accuracy : 0.6269
## 95% CI : (0.6179, 0.6359)
## No Information Rate : 0.518
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.2468
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.4960
## Specificity : 0.7488
## Pos Pred Value : 0.6476
## Neg Pred Value : 0.6149
## Prevalence : 0.4820
## Detection Rate : 0.2391
## Detection Prevalence : 0.3692
## Balanced Accuracy : 0.6224
##
## 'Positive' Class : positive
##
Next we will build another model using different algorithm. we will use Decision tree and MARS. but before that, we need to make data frame with cleaned data. The token value will not be converted to 1 or 0 like naive bayes. it’ll remain original.
dat.clean <- as.data.frame(as.matrix(dat.dtm), stringsAsFactors = F)
# we have 800+ variable in words form. i change the label name from `mood` to labelY to avoid overwriting column names
new.dat <- cbind(dat.clean, data.frame(labelY = dat$mood))
splitting
set.seed(1502)
splitter <- initial_split(new.dat, prop = 0.75, strata = "labelY")
train <- training(splitter)
test <- testing(splitter)
mod.dt <- decision_tree(mode = "classification") %>%
set_engine("rpart") %>% fit(labelY~., data = train)
pred.dt <- predict(mod.dt, test,
type = "class")
pred.dt.x <- as.data.frame(cbind(pred.dt, test$labelY)) %>%
setNames(c("pred","actual"))
pred.dt.x
create confusion matrix for later evaluation
cf.dt <- confusionMatrix(data = pred.dt.x$pred,
reference = pred.dt.x$actual,
positive = "positive")
cf.dt
## Confusion Matrix and Statistics
##
## Reference
## Prediction negative positive
## negative 3671 2404
## positive 2188 2986
##
## Accuracy : 0.5918
## 95% CI : (0.5826, 0.6009)
## No Information Rate : 0.5208
## P-Value [Acc > NIR] : < 0.0000000000000002
##
## Kappa : 0.1808
##
## Mcnemar's Test P-Value : 0.00151
##
## Sensitivity : 0.5540
## Specificity : 0.6266
## Pos Pred Value : 0.5771
## Neg Pred Value : 0.6043
## Prevalence : 0.4792
## Detection Rate : 0.2654
## Detection Prevalence : 0.4600
## Balanced Accuracy : 0.5903
##
## 'Positive' Class : positive
##
next we build 3rd model using MARS algorithm
mod.mars <- mars(mode = "classification") %>%
set_engine("earth") %>% fit(labelY~., data = train)
pred.mars <- predict(mod.mars, test,
type = "class")
pred.mars.x <- as.data.frame(cbind(pred.mars, test$labelY)) %>%
setNames(c("pred","actual"))
pred.mars.x
create confusion matrix for later use
cf.mars <- confusionMatrix(data = pred.mars.x$pred,
reference = pred.mars.x$actual,
positive = "positive")
cf.mars
## Confusion Matrix and Statistics
##
## Reference
## Prediction negative positive
## negative 4403 2880
## positive 1456 2510
##
## Accuracy : 0.6145
## 95% CI : (0.6055, 0.6236)
## No Information Rate : 0.5208
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.2195
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.4657
## Specificity : 0.7515
## Pos Pred Value : 0.6329
## Neg Pred Value : 0.6046
## Prevalence : 0.4792
## Detection Rate : 0.2231
## Detection Prevalence : 0.3526
## Balanced Accuracy : 0.6086
##
## 'Positive' Class : positive
##
# this chunks are made for random forest model and future model tuning
## the column names like break,for,next,if are considered as special character thus raises an error when building random forest and model tuning.
## i store the train and test data to new variabel so the old one remain reproducible
train_tune <- train
test_tune <- test
colnames(train_tune) <- make.names(colnames(train_tune))
colnames(test_tune) <- make.names(colnames(test_tune))
# build 5 folds cross validation for tuning evaluationn
set.seed(1502)
folds <- vfold_cv(train_tune, 3)
mod.rf <- rand_forest(trees = 500, mtry = 5, mode = "classification") %>%
set_engine("ranger") %>% fit(labelY~., data = train_tune)
pred.rf <- predict(mod.rf, test_tune,
type = "class")
pred.rf.x <- as.data.frame(cbind(pred.rf, test_tune$labelY)) %>%
setNames(c("pred","actual"))
pred.rf.x
create confusion matrix for later use
cf.rf <- confusionMatrix(data = pred.rf.x$pred,
reference = pred.rf.x$actual,
positive = "positive")
cf.rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction negative positive
## negative 4140 2019
## positive 1719 3371
##
## Accuracy : 0.6677
## 95% CI : (0.6589, 0.6764)
## No Information Rate : 0.5208
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.3328
##
## Mcnemar's Test P-Value : 0.000001006
##
## Sensitivity : 0.6254
## Specificity : 0.7066
## Pos Pred Value : 0.6623
## Neg Pred Value : 0.6722
## Prevalence : 0.4792
## Detection Rate : 0.2997
## Detection Prevalence : 0.4525
## Balanced Accuracy : 0.6660
##
## 'Positive' Class : positive
##
Sadly i’m not satisfy about the result. the highest Accuracy only 66.75%. i will try to tune some models in hope we can get better result
in decision tree we can do some parameter tuning like cost_complexity
, tree_depth
, and min_n
. This time, we will do a grid tuning for tree_depth and min_n by given number. we’ll do the grid search 5 times with 3 k-fold cross validation
# create the grid. the tune will build models using combination of randomized cost_complexity, tree_depth, and min_n from grid_max_entropy function
dt.grid <- expand.grid(tree_depth = c(23,25,27,29,30), min_n = c(20,21,22,23,24))
dt.setup <- decision_tree(min_n = tune(), tree_depth = tune(), cost_complexity = 0.01) %>%
set_engine("rpart") %>%
set_mode("classification")
# dt.tune <- tune_grid(labelY~., model = dt.setup, grid = dt.grid, resamples = folds,
# metrics = metric_set(accuracy, spec, sens))
The grid tuning really takes a lot of time. My pc RAM can’t even load this notebook before i clean the rhistory and environment. So i can’t save the output but i ensure you the best parameters are tree_depth = 23 and min_n = 20.
# select the best parameters and apply to new model
# best.dt <- dt.tune %>% select_best("accuracy", maximize = F)
#
# # The best parameters for highest accuract are tree_depth = 23, min_n =20
# mod.dt.2 <- dt.setup %>% finalize_model(parameters =
# best.dt)
# build new model using the best tuned parameters
# mod.dt.2x <- mod.dt.2 %>% fit(labelY~., data = train)
mod.dt.2x <- decision_tree(mode = "classification",
tree_depth = 23, min_n = 20, cost_complexity = 0.01) %>%
set_engine("rpart") %>% fit(labelY~., data = train_tune)
# predict new model to unseen data
pred.dt.2 <- predict(mod.dt.2x, test_tune,
type = "class")
pred.dt.2.x <- as.data.frame(cbind(pred.dt.2, test_tune$labelY)) %>%
setNames(c("pred","actual"))
pred.dt.2.x
cf.dt.2 <- confusionMatrix(data = pred.dt.2.x$pred,
reference = pred.dt.2.x$actual,
positive = "positive")
cf.dt.2
## Confusion Matrix and Statistics
##
## Reference
## Prediction negative positive
## negative 3671 2404
## positive 2188 2986
##
## Accuracy : 0.5918
## 95% CI : (0.5826, 0.6009)
## No Information Rate : 0.5208
## P-Value [Acc > NIR] : < 0.0000000000000002
##
## Kappa : 0.1808
##
## Mcnemar's Test P-Value : 0.00151
##
## Sensitivity : 0.5540
## Specificity : 0.6266
## Pos Pred Value : 0.5771
## Neg Pred Value : 0.6043
## Prevalence : 0.4792
## Detection Rate : 0.2654
## Detection Prevalence : 0.4600
## Balanced Accuracy : 0.5903
##
## 'Positive' Class : positive
##
in Random Forest we can do some parameter tuning like trees
,and mtry
. This time, we will do a grid tuning for number of trees and mtry by given number. we’ll do the grid search 4 times with 3 k-fold cross validation.
rf.grid <- expand.grid(trees = c(450,500,550,600), mtry = 3:6)
rf.setup <- rand_forest(trees = tune(), mtry = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
# this tuning also takes a lot of time.
# rf.tune <- tune_grid(labelY~., model = rf.setup, resamples = folds,
# grid = rf.grid, metrics = metric_set(accuracy, sens, spec))
The tuning take a lots of time. the parameter for best results are mtry = 6 and trees = 550. here’s i show you the code but for time efficiency ill exclude it and load the pre-build model instead
# best.rfX <- rf.tune %>% select_best("accuracy", maximize = F)
# # the best parameter for highest reacall are mtry = 6 and trees = 550
# mod.rf.2X <- rf.setup %>% finalize_model(parameters =
# best.rfX)
# # rebuild the model
# mod.rf.2.new <- mod.rf.2X %>% fit(labelY~., data = train_tune)
# predict new model to unseen data
pred.rf.2 <- predict(mod.rf.2.x, test_tune,
type = "class")
pred.rf.2.x <- as.data.frame(cbind(pred.rf.2, test_tune$labelY)) %>%
setNames(c("pred","actual"))
pred.rf.2.x
build confusion matrix for evaluating
cf.rf.2 <- confusionMatrix(data = pred.rf.2.x$pred,
reference = pred.rf.2.x$actual,
positive = "positive")
cf.rf.2
## Confusion Matrix and Statistics
##
## Reference
## Prediction negative positive
## negative 4117 1981
## positive 1742 3409
##
## Accuracy : 0.669
## 95% CI : (0.6603, 0.6777)
## No Information Rate : 0.5208
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.3357
##
## Mcnemar's Test P-Value : 0.00009596
##
## Sensitivity : 0.6325
## Specificity : 0.7027
## Pos Pred Value : 0.6618
## Neg Pred Value : 0.6751
## Prevalence : 0.4792
## Detection Rate : 0.3030
## Detection Prevalence : 0.4579
## Balanced Accuracy : 0.6676
##
## 'Positive' Class : positive
##
We’ve got verry little improvement from accuracy 66.7 to 66.9
Let’s combine all the confusion matrix to make the evaluation easier
df.nb <- data.frame(t(as.matrix(cf.nb, what = "classes")))
df.nb <- cbind(df.nb, data.frame(t(as.matrix(cf.nb,what = "overall"))))
df.dt <- data.frame(t(as.matrix(cf.dt, what = "classes")))
df.dt <- cbind(df.dt, data.frame(t(as.matrix(cf.dt,what = "overall"))))
df.mars <- data.frame(t(as.matrix(cf.mars, what = "classes")))
df.mars <- cbind(df.mars, data.frame(t(as.matrix(cf.mars,what = "overall"))))
df.rf <- data.frame(t(as.matrix(cf.rf, what = "classes")))
df.rf <- cbind(df.rf, data.frame(t(as.matrix(cf.rf,what = "overall"))))
df.dt.2 <- data.frame(t(as.matrix(cf.dt.2, what = "classes")))
df.dt.2 <- cbind(df.dt.2, data.frame(t(as.matrix(cf.dt.2,what = "overall"))))
df.rf.2 <- data.frame(t(as.matrix(cf.rf.2, what = "classes")))
df.rf.2 <- cbind(df.rf.2, data.frame(t(as.matrix(cf.rf.2,what = "overall"))))
all.eval <- rbind(Naive_Bayes = df.nb,
Decision_Tree = df.dt,
Mars = df.mars,
Random_Forest = df.rf,
Decision_Tree_tuned = df.dt.2,
Random_Forest_tuned = df.rf.2) %>%
select("Accuracy","Sensitivity","Specificity","Precision","F1") %>% data.frame()
all.eval
Since there’s no urgenity in this case, we will choose Accuracy as our high-priority metric to solve the case. User can easily remove or skip if they dont like the recommended songs and it will not affect our operational cost. Positive song in sad song playlist will not harm anyone but its better if we try avoid it.
As we can see from the table above, Random Forest tuned
model has the highest Accuracy. It’ll always possible to have higher accuracy (or other metrics) if we try another classification model. We’ll do that in the future. So in conclusion, we’ll use Random Forest model to predict song’s mood based on its lyric.
we only cover approximately 45k songs. there’s thousand if not million songs worldwide and it’s such a shame if we can predict the mood given the song’s lyric. so here we will try to build a function to suit a plain new lyric text into our model. The data will be cleaned up automaitcaly before we predict their mood.
here i will use a song from One Piece OST opening 3 titled ‘hikari e’ (to the light). the song is originally japanese but i translate it to match our builded model.
# new text lyric
text <- "I've just now begun to search, over the splashing waves
For the everlasting world
With this overflowing passion in my chest, I will go anywhere
Seeking the light yet unseen.
When the summer sun shakes my heart's sail
That's the signal to open the door to a new world
Swaying on the waves, supassing my despair
Aiming for the other side of the horizon.
I've just now begun to search, over the splashing waves,
For the everlasting world
With this overflowing passion in my chest, I will go anywhere,
Seeking the light yet unseen.
A current of repetitious days and mundane clouds
I see reflected in you a future you can't possibly know
Even if I avoid pain by not changing
That leaves me without dreams or even hope -- so let's go!.
Why am I searching? What is it I want?
The answer is surely somewhere ahead
My heart will go on to the moving world
Hiding my yet unseen strength.
Why am I searching? What is it I want?
Where is the yet unseen treasure?
With this overflowing passion in my chest, how far can I go?
I don't know, but
I've just now begun to search, over the splashing waves,
For the everlasting world
With this overflowing passion in my chest, I will go anywhere,
Seeking the light yet unseen
To the other side"
here’s the function. its just all the cleaning step combined into one function and build new data frame as the output. it also matching words as predictor variable to required column names (word in this case) in train data.
textcleaner <- function(x){
x <- as.character(x)
x <- x %>%
str_to_lower() %>%
replace_contraction() %>%
replace_internet_slang() %>%
replace_word_elongation() %>%
replace_number(remove = T) %>%
replace_date(replacement = "") %>%
str_remove_all(pattern = "[[:punct:]]") %>%
str_squish() %>%
str_trim()
xdtm <- VCorpus(VectorSource(x)) %>%
tm_map(removeWords, stopwords("en")) %>%
tm_map(stemDocument) %>%
DocumentTermMatrix(control = list(
dictionary = names(train)
))
dfx <- as.data.frame(as.matrix(xdtm), stringAsFactors=F)
}
let’s predict the mood using mars model
Our Mars model predict it as negative mood music.
Random forest algorithm can’t copy special character column names like for,breaks,and next so we build a different function for it. the difference is only in dictionary names its follow the modified column names in train_tune and test_tune
textcleaner_2 <- function(x){
x <- as.character(x)
x <- x %>%
str_to_lower() %>%
replace_contraction() %>%
replace_internet_slang() %>%
replace_word_elongation() %>%
replace_number(remove = T) %>%
replace_date(replacement = "") %>%
str_remove_all(pattern = "[[:punct:]]") %>%
str_squish() %>%
str_trim()
xdtm <- VCorpus(VectorSource(x)) %>%
tm_map(removeWords, stopwords("en")) %>%
tm_map(stemDocument) %>%
DocumentTermMatrix(control = list(
dictionary = names(train_tune)
))
dfx <- as.data.frame(as.matrix(xdtm), stringAsFactors=F)
}
The random forest model (best model in this case) also predict the lyric as negative-mood song.
we’ve different format for Naive Bayes. we’ll also build function to clean up the text and matching it to Naive Bayes requirement. the only different is in the last step we apply benoulli converter and return it as transposed matrix.
textcleaner_nb <- function(x){
x <- as.character(x)
x <- x %>%
str_to_lower() %>%
replace_contraction() %>%
replace_internet_slang() %>%
replace_word_elongation() %>%
replace_number(remove = T) %>%
replace_date(replacement = "") %>%
str_remove_all(pattern = "[[:punct:]]") %>%
str_squish() %>%
str_trim()
xdtm <- VCorpus(VectorSource(x)) %>%
tm_map(removeWords, stopwords("en")) %>%
tm_map(stemDocument) %>%
DocumentTermMatrix(control = list(
dictionary = names(train)
))
dfx <- as.data.frame(as.matrix(xdtm), stringAsFactors=F)
dfx <- apply(dfx,2,bernoulli_conv)
return(t(as.matrix(dfx)))
}
Predict the lyrics using naive bayes model
## [1] negative
## Levels: negative positive
Naive Bayes Model are also predict the lyric as negative mood music. if you hear the real song, its actually a spirit, energic, and positive mood music but i never know what’s the lyrics actually say.
thank you!