Music is not only able to affect your mood. listening to particularly happy or sad music can even change the way we perceive the world, according to researchers from the University of Groningen. In this modern world we have an ability to choose what music we want to listen easily. Some music player platform such as Spotify are known to its music recommender system. where they recommend music based on their customer historical or genre preferences individually. It will be a new idea if music can be enjoyed by its lyric and will get recommendations based on the mood of the lyrics.

Background

Objective

This project is based on this kaggle dataset. The dataset contains 250k lyric with Valence value gathered using Spotify API. Valence is A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). Our task in this article is to perform supervised NLP sentiment analysis to measure the positiveness of a song. This kind of analysis can be used for the Spotify company itself to improve its music recommender system based on lyric (words).

Limitation: Languange are wide and complex. NLP are also known to its high computational value. So in this analysis i will only use english song lyrics and sampled the data to only 45k songs.

Libraries

You can load the package into your workspace using the library() function

library(dplyr)
library(tidytext)
library(textclean)
library(tm)
library(SnowballC)
library(stringr)
library(rsample)
library(cld2)
library(caret)
library(e1071)
library(tidymodels)

Let’s begin

Data Import

dat <- read.csv("labeled_lyrics_cleaned.csv")
glimpse(dat)

## Observations: 158,353
## Variables: 5
## $ X      <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ artist <fct> Elijah Blake, Elijah Blake, Elijah Blake, Elijah Blake, Elij...
## $ seq    <fct> "No, no\nI ain't ever trapped out the bando\nBut oh Lord, do...
## $ song   <fct> Everyday, Live Till We Die, The Otherside, Pinot, Shadows & ...
## $ label  <dbl> 0.6260, 0.6300, 0.2400, 0.5360, 0.3710, 0.3210, 0.6010, 0.33...

Data Wrangling

Feature Engineering

head(dat$seq,1)

## [1] No, no\nI ain't ever trapped out the bando\nBut oh Lord, don't get me wrong\nI know a couple niggas that do\nI'm from a place where everybody knows your name\nThey say I gotta watch my attitude\nWhen they see money, man they all start actin' strange\nSo fuck with the ones that fuck with you\nThey can never say I'm brand new\n\nIt's everyday, everyday\nEveryday, everyday, everyday\nEveryday, everyday\nEveryday, everyday\nI've been talkin' my shit, nigga that's regular\nI've been lovin' 'em thick, life is spectacular\nI spend like I'ma die rich, nigga I'm flexin', yeah\nEveryday, that's everyday\nThat's everyday\nThat's everyday\nThat's everyday, everyday\n\nI see all of these wanna-be hot R&B singers\nI swear you all sound the same\nThey start from the bottom, so far from the motto\nYou niggas'll never be Drake\nShout out to OVO\nMost of them prolly don't know me though\nI stay in the cut, I don't fuck with no\nBody but I D, that's a pun on No I.D\nWhen nobody know my name\nRunnin' for my dream wasn't hard to do\nYou break bread, I swear they all pull out a plate\nEat with the ones who starved with you\nIf I'm winnin' then my crew can't lose\n\nIt's everyday, everyday\nEveryday, everyday, everyday\nEveryday, everyday\nEveryday, everyday\nI've been talkin' my shit, nigga that's regular\nI've been lovin' 'em thick, life is spectacular\nI spend like I'ma die rich, nigga I'm flexin', yeah\nEveryday, that's everyday\nThat's everyday\nThat's everyday\nThat's everyday, everyday\n\nI heard since you got money\nYou changed, you're actin' funny\nThat's why I gets on my lonely\nYou be lovin' when change is a hobby\nWho do you dress when you ain't got nobody?\n\nIt's everyday, everyday\nEveryday, everyday, everyday\nEveryday, everyday\nEveryday, everyday\nI've been talkin' my shit, nigga that's regular\nI've been lovin' 'em thick, life is spectacular\nI spend like I'ma die rich, nigga I'm flexin', yeah\nEveryday, that's everyday\nThat's everyday\nThat's everyday\nThat's everyday, everyday
## 135645 Levels: ''Do you want... to have... a tasty... mushroom?' ...

The lyrics are stored in seq coloumn. as you can see it will need a lot of treatment before modeling. The simplest thing we can do first is to remove “ n” as its new line break. The target column (label) still in numeric format. as i said before, higher valence (label) value means the songs are considered as positive mood and lower valence means negative mood. I’ll convert the valence value to binary value labeled positive and negative with 0.5 as center value. I also want to filter english only lyrics to perform the NLP easier. I’ll use function from cld2 package to detect the lyric languange.

dat$seq <- str_replace_all(as.character(dat$seq), "\n"," ")
# valence with > 0.5 will labelled as potiive, < 0.5 negative
dat$mood <- ifelse(dat$label > 0.5, "positive","negative")
dat$lang <- cld2::detect_language(dat$seq)
# filter the data to english lyric only
dat <- dat[dat$lang == "en",]

let’s see how our data has changed

head(dat)

Next, due to my machine limitation, i only use 45k songs for analysis. the songs are selected from random sampling

set.seed(1502)
# only select data with complete information
dat <- dat[complete.cases(dat),]
dat <- dat[sample(nrow(dat),45000),]

Text cleaning

`textclean` x `stringr` x `tm`

in text cleaning, i’m more familiar with tm and stringr package but i also want to learn textclean (from tidytext) magic. i’ll use both package to clean my text data before modeling

dat <- dat %>%
  mutate(text_clean = seq %>%  # select seq column
           str_to_lower() %>%  # convert all the string to low alphabet
           replace_contraction() %>% # replace contraction to their multi-word forms
           replace_internet_slang() %>% # replace internet slang to normal words
           replace_word_elongation() %>% # reolace informal writing with known semantic replacements
           replace_number(remove = T) %>% # remove number
           replace_date(replacement = "") %>% # remove date
           str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
           str_squish() %>% # reduces repeated whitespace inside a string.
           str_trim() # removes whitespace from start and end of string
         )

# convert text data to corpus using tm package then do tokenize
corp <- VCorpus(VectorSource(dat$text_clean))

corp_dtm <- corp %>% 
  # use pre-build english stopwords
  tm_map(removeWords, stopwords("en")) %>%
  tm_map(stemDocument) %>%
  # convert corpus to document term matrix
  DocumentTermMatrix()

Next i want to limit the number of word to be used in modeling. i’ll only choose words that appear in 850 songs

freq_term <- findFreqTerms(corp_dtm, 850)
# 815 words are selected
dat.dtm <- corp_dtm[,freq_term]

Modeling

Naive Bayes

Modeling using NB need special treatment in the train data. the column represents words and each row represents one single song. NB doesnt need the exact number of each words, it only need to know if the words are present in the song or not. thus, we convert the value in each cell to contain either 1 or 0. 1 means this specific word is present in the song, 0 means not present.

# split the data. 75% for train data, and 25% for test data
set.seed(1502)
index <- sample(1:nrow(dat.dtm), 0.75*nrow(dat.dtm))

train_x <- dat.dtm[index,]
test_x <- dat.dtm[-index,]
# subset label/target variable
train_label <- dat[index,"mood"]
test_label <- dat[-index,"mood"]

Use bernoulli converter to convert any value above 0 to 1 and 0 to remain 0.

# bernoulli conv 
bernoulli_conv <- function(x){
  x <- as.factor(as.numeric(x>0))
}

apply bernoulli converter to train and test data

train_x <- apply(train_x,2,bernoulli_conv)
test_x <- apply(test_x,2,bernoulli_conv)

0 in a cell indicates the song doesn’t have a particular word. it also means that corresponding class-feature combination has a 0 probability of occuring. it will ruin the NB algorithm which computes the conditional a-posterior probabilities of a categorical class variable given independent predictor variables using the Bayes rule. We could specify laplace=1 to enable an add-one smoothing.

# train the model
mod.nb <- naiveBayes(train_x, as.factor(train_label), laplace = 1)
# predict to test data
pred.nb <- predict(mod.nb, test_x,
                   type = "class")

pred.nb.x <- cbind(data.frame(pred.nb),test_label)%>%
  setNames(c("pred","actual"))

create confusion matrix for later evaluation

cf.nb <- confusionMatrix(data = pred.nb.x$pred,
                         reference = pred.nb.x$actual,
                         positive = "positive")
cf.nb

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative     4363     2733
##   positive     1464     2690
##                                                
##                Accuracy : 0.6269               
##                  95% CI : (0.6179, 0.6359)     
##     No Information Rate : 0.518                
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.2468               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.4960               
##             Specificity : 0.7488               
##          Pos Pred Value : 0.6476               
##          Neg Pred Value : 0.6149               
##              Prevalence : 0.4820               
##          Detection Rate : 0.2391               
##    Detection Prevalence : 0.3692               
##       Balanced Accuracy : 0.6224               
##                                                
##        'Positive' Class : positive             
##

Decision Tree

Next we will build another model using different algorithm. we will use Decision tree and MARS. but before that, we need to make data frame with cleaned data. The token value will not be converted to 1 or 0 like naive bayes. it’ll remain original.

dat.clean <- as.data.frame(as.matrix(dat.dtm), stringsAsFactors = F)
# we have 800+ variable in words form. i change the label name from `mood` to labelY to avoid overwriting column names
new.dat <- cbind(dat.clean, data.frame(labelY = dat$mood))

splitting

set.seed(1502)
splitter <- initial_split(new.dat, prop = 0.75, strata = "labelY")
train <- training(splitter)
test <- testing(splitter)

mod.dt <- decision_tree(mode = "classification") %>%
  set_engine("rpart") %>% fit(labelY~., data = train)

pred.dt <- predict(mod.dt, test, 
                   type = "class")

pred.dt.x <- as.data.frame(cbind(pred.dt, test$labelY)) %>%
  setNames(c("pred","actual"))

pred.dt.x

create confusion matrix for later evaluation

cf.dt <- confusionMatrix(data = pred.dt.x$pred,
                         reference = pred.dt.x$actual,
                         positive = "positive")
cf.dt

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative     3671     2404
##   positive     2188     2986
##                                               
##                Accuracy : 0.5918              
##                  95% CI : (0.5826, 0.6009)    
##     No Information Rate : 0.5208              
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.1808              
##                                               
##  Mcnemar's Test P-Value : 0.00151             
##                                               
##             Sensitivity : 0.5540              
##             Specificity : 0.6266              
##          Pos Pred Value : 0.5771              
##          Neg Pred Value : 0.6043              
##              Prevalence : 0.4792              
##          Detection Rate : 0.2654              
##    Detection Prevalence : 0.4600              
##       Balanced Accuracy : 0.5903              
##                                               
##        'Positive' Class : positive            
##

Mars

next we build 3rd model using MARS algorithm

mod.mars <- mars(mode = "classification") %>%
  set_engine("earth") %>% fit(labelY~., data = train)

pred.mars <- predict(mod.mars, test, 
                   type = "class")

pred.mars.x <- as.data.frame(cbind(pred.mars, test$labelY)) %>%
  setNames(c("pred","actual"))

pred.mars.x

create confusion matrix for later use

cf.mars <- confusionMatrix(data = pred.mars.x$pred,
                         reference = pred.mars.x$actual,
                         positive = "positive")
cf.mars

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative     4403     2880
##   positive     1456     2510
##                                                
##                Accuracy : 0.6145               
##                  95% CI : (0.6055, 0.6236)     
##     No Information Rate : 0.5208               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.2195               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.4657               
##             Specificity : 0.7515               
##          Pos Pred Value : 0.6329               
##          Neg Pred Value : 0.6046               
##              Prevalence : 0.4792               
##          Detection Rate : 0.2231               
##    Detection Prevalence : 0.3526               
##       Balanced Accuracy : 0.6086               
##                                                
##        'Positive' Class : positive             
##

# this chunks are made for random forest model and future model tuning
## the column names like break,for,next,if are considered as special character thus raises an error when building random forest and model tuning.
## i store the train and test data to new variabel so the old one remain reproducible
train_tune <- train
test_tune <- test

colnames(train_tune) <- make.names(colnames(train_tune))
colnames(test_tune) <- make.names(colnames(test_tune))

# build 5 folds cross validation for tuning evaluationn
set.seed(1502)
folds <- vfold_cv(train_tune, 3)

Random Forest

mod.rf <- rand_forest(trees = 500, mtry = 5, mode = "classification") %>%
  set_engine("ranger") %>% fit(labelY~., data = train_tune)

pred.rf <- predict(mod.rf, test_tune, 
                   type = "class")

pred.rf.x <- as.data.frame(cbind(pred.rf, test_tune$labelY)) %>%
  setNames(c("pred","actual"))

pred.rf.x

create confusion matrix for later use

cf.rf <- confusionMatrix(data = pred.rf.x$pred,
                         reference = pred.rf.x$actual,
                         positive = "positive")
cf.rf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative     4140     2019
##   positive     1719     3371
##                                                
##                Accuracy : 0.6677               
##                  95% CI : (0.6589, 0.6764)     
##     No Information Rate : 0.5208               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3328               
##                                                
##  Mcnemar's Test P-Value : 0.000001006          
##                                                
##             Sensitivity : 0.6254               
##             Specificity : 0.7066               
##          Pos Pred Value : 0.6623               
##          Neg Pred Value : 0.6722               
##              Prevalence : 0.4792               
##          Detection Rate : 0.2997               
##    Detection Prevalence : 0.4525               
##       Balanced Accuracy : 0.6660               
##                                                
##        'Positive' Class : positive             
##

Model Tuning

Sadly i’m not satisfy about the result. the highest Accuracy only 66.75%. i will try to tune some models in hope we can get better result

Decision Tree

in decision tree we can do some parameter tuning like cost_complexity, tree_depth, and min_n. This time, we will do a grid tuning for tree_depth and min_n by given number. we’ll do the grid search 5 times with 3 k-fold cross validation

# create the grid. the tune will build models using combination of randomized cost_complexity, tree_depth, and min_n from grid_max_entropy function
dt.grid <- expand.grid(tree_depth = c(23,25,27,29,30), min_n = c(20,21,22,23,24))

dt.setup <- decision_tree(min_n = tune(), tree_depth = tune(), cost_complexity = 0.01) %>%
  set_engine("rpart") %>%
  set_mode("classification")

# dt.tune <- tune_grid(labelY~., model = dt.setup, grid = dt.grid, resamples = folds,
#                      metrics = metric_set(accuracy, spec, sens))

The grid tuning really takes a lot of time. My pc RAM can’t even load this notebook before i clean the rhistory and environment. So i can’t save the output but i ensure you the best parameters are tree_depth = 23 and min_n = 20.

#show_best(dt.tune, maximize = F, metric = "accuracy")

# select the best parameters and apply to new model
# best.dt <- dt.tune %>% select_best("accuracy", maximize = F)
# 
# # The best parameters for highest accuract are tree_depth = 23, min_n =20
# mod.dt.2 <- dt.setup %>% finalize_model(parameters = 
#                                           best.dt)

# build new model using the best tuned parameters
# mod.dt.2x <- mod.dt.2 %>% fit(labelY~., data = train)

mod.dt.2x <- decision_tree(mode = "classification",
                           tree_depth = 23, min_n = 20, cost_complexity = 0.01) %>%
  set_engine("rpart") %>% fit(labelY~., data = train_tune)

# predict new model to unseen data
pred.dt.2 <- predict(mod.dt.2x, test_tune, 
                   type = "class")

pred.dt.2.x <- as.data.frame(cbind(pred.dt.2, test_tune$labelY)) %>%
  setNames(c("pred","actual"))

pred.dt.2.x

cf.dt.2 <- confusionMatrix(data = pred.dt.2.x$pred,
                         reference = pred.dt.2.x$actual,
                         positive = "positive")
cf.dt.2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative     3671     2404
##   positive     2188     2986
##                                               
##                Accuracy : 0.5918              
##                  95% CI : (0.5826, 0.6009)    
##     No Information Rate : 0.5208              
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.1808              
##                                               
##  Mcnemar's Test P-Value : 0.00151             
##                                               
##             Sensitivity : 0.5540              
##             Specificity : 0.6266              
##          Pos Pred Value : 0.5771              
##          Neg Pred Value : 0.6043              
##              Prevalence : 0.4792              
##          Detection Rate : 0.2654              
##    Detection Prevalence : 0.4600              
##       Balanced Accuracy : 0.5903              
##                                               
##        'Positive' Class : positive            
##

Random Forest

in Random Forest we can do some parameter tuning like trees,and mtry. This time, we will do a grid tuning for number of trees and mtry by given number. we’ll do the grid search 4 times with 3 k-fold cross validation.

rf.grid <- expand.grid(trees = c(450,500,550,600), mtry = 3:6)

rf.setup <- rand_forest(trees = tune(), mtry = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")

# this tuning also takes a lot of time. 
# rf.tune <- tune_grid(labelY~., model = rf.setup, resamples = folds,
#                      grid = rf.grid, metrics = metric_set(accuracy, sens, spec))

The tuning take a lots of time. the parameter for best results are mtry = 6 and trees = 550. here’s i show you the code but for time efficiency ill exclude it and load the pre-build model instead

# best.rfX <- rf.tune %>% select_best("accuracy", maximize = F)
# # the best parameter for highest reacall are mtry = 6 and trees = 550
# mod.rf.2X <- rf.setup %>% finalize_model(parameters = 
#                                           best.rfX)
# # rebuild the model
# mod.rf.2.new <- mod.rf.2X %>% fit(labelY~., data = train_tune)

# load pre-build rf tuned model
mod.rf.2.x <- readRDS("rf_mod_tuned.rds")

# predict new model to unseen data
pred.rf.2 <- predict(mod.rf.2.x, test_tune, 
                   type = "class")

pred.rf.2.x <- as.data.frame(cbind(pred.rf.2, test_tune$labelY)) %>%
  setNames(c("pred","actual"))

pred.rf.2.x

build confusion matrix for evaluating

cf.rf.2 <- confusionMatrix(data = pred.rf.2.x$pred,
                         reference = pred.rf.2.x$actual,
                         positive = "positive")
cf.rf.2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative     4117     1981
##   positive     1742     3409
##                                                
##                Accuracy : 0.669                
##                  95% CI : (0.6603, 0.6777)     
##     No Information Rate : 0.5208               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3357               
##                                                
##  Mcnemar's Test P-Value : 0.00009596           
##                                                
##             Sensitivity : 0.6325               
##             Specificity : 0.7027               
##          Pos Pred Value : 0.6618               
##          Neg Pred Value : 0.6751               
##              Prevalence : 0.4792               
##          Detection Rate : 0.3030               
##    Detection Prevalence : 0.4579               
##       Balanced Accuracy : 0.6676               
##                                                
##        'Positive' Class : positive             
##

We’ve got verry little improvement from accuracy 66.7 to 66.9

Model Evaluation and conclusion

Let’s combine all the confusion matrix to make the evaluation easier

df.nb <- data.frame(t(as.matrix(cf.nb, what = "classes")))
df.nb <- cbind(df.nb, data.frame(t(as.matrix(cf.nb,what = "overall"))))

df.dt <- data.frame(t(as.matrix(cf.dt, what = "classes")))
df.dt <- cbind(df.dt, data.frame(t(as.matrix(cf.dt,what = "overall"))))

df.mars <- data.frame(t(as.matrix(cf.mars, what = "classes")))
df.mars <- cbind(df.mars, data.frame(t(as.matrix(cf.mars,what = "overall"))))

df.rf <- data.frame(t(as.matrix(cf.rf, what = "classes")))
df.rf <- cbind(df.rf, data.frame(t(as.matrix(cf.rf,what = "overall"))))

df.dt.2 <- data.frame(t(as.matrix(cf.dt.2, what = "classes")))
df.dt.2 <- cbind(df.dt.2, data.frame(t(as.matrix(cf.dt.2,what = "overall"))))

df.rf.2 <- data.frame(t(as.matrix(cf.rf.2, what = "classes")))
df.rf.2 <- cbind(df.rf.2, data.frame(t(as.matrix(cf.rf.2,what = "overall"))))

all.eval <- rbind(Naive_Bayes = df.nb, 
                  Decision_Tree = df.dt,
                  Mars = df.mars,
                  Random_Forest = df.rf,
                  Decision_Tree_tuned = df.dt.2,
                  Random_Forest_tuned = df.rf.2) %>%
  select("Accuracy","Sensitivity","Specificity","Precision","F1") %>% data.frame()

all.eval

Since there’s no urgenity in this case, we will choose Accuracy as our high-priority metric to solve the case. User can easily remove or skip if they dont like the recommended songs and it will not affect our operational cost. Positive song in sad song playlist will not harm anyone but its better if we try avoid it.

As we can see from the table above, Random Forest tuned model has the highest Accuracy. It’ll always possible to have higher accuracy (or other metrics) if we try another classification model. We’ll do that in the future. So in conclusion, we’ll use Random Forest model to predict song’s mood based on its lyric.

Predicting new given lyric

we only cover approximately 45k songs. there’s thousand if not million songs worldwide and it’s such a shame if we can predict the mood given the song’s lyric. so here we will try to build a function to suit a plain new lyric text into our model. The data will be cleaned up automaitcaly before we predict their mood.

here i will use a song from One Piece OST opening 3 titled ‘hikari e’ (to the light). the song is originally japanese but i translate it to match our builded model.

# new text lyric
text <- "I've just now begun to search, over the splashing waves
For the everlasting world
With this overflowing passion in my chest, I will go anywhere
Seeking the light yet unseen. 

When  the summer sun shakes my heart's sail
That's the signal to open the door to a new world
Swaying on the waves, supassing my despair
Aiming for the other side of the horizon. 

I've just now begun to search, over the splashing waves,
For the everlasting world
With this overflowing passion in my chest, I will go anywhere,
Seeking the light yet unseen. 

A current of repetitious days and mundane clouds
I see reflected in you a future you can't possibly know
Even if I avoid pain by not changing
That leaves me without dreams or even hope -- so let's go!. 

Why am I searching?  What is it I want?
The answer is surely somewhere ahead
My heart will go on to the moving world
Hiding my yet unseen strength. 

Why am I searching?  What is it I want?
Where is the yet unseen treasure?
With this overflowing passion in my chest, how far can I go?
I don't know, but 

I've just now begun to search, over the splashing waves,
For the everlasting world
With this overflowing passion in my chest, I will go anywhere,
Seeking the light yet unseen

To the other side"

Model type_1

here’s the function. its just all the cleaning step combined into one function and build new data frame as the output. it also matching words as predictor variable to required column names (word in this case) in train data.

textcleaner <- function(x){
  x <- as.character(x)
  
  x <- x %>%
    str_to_lower() %>%
    replace_contraction() %>%
    replace_internet_slang() %>%
    replace_word_elongation() %>%
    replace_number(remove = T) %>%
    replace_date(replacement = "") %>%
    str_remove_all(pattern = "[[:punct:]]") %>%
    str_squish() %>%
    str_trim()
  
  xdtm <- VCorpus(VectorSource(x)) %>%
    tm_map(removeWords, stopwords("en")) %>%
    tm_map(stemDocument) %>% 
    DocumentTermMatrix(control = list(
      dictionary = names(train)
    ))
  
  dfx <- as.data.frame(as.matrix(xdtm), stringAsFactors=F)
    
}

let’s predict the mood using mars model

samptext <- textcleaner(text)

predict(mod.mars,samptext)

Our Mars model predict it as negative mood music.

Model type_2

Random forest algorithm can’t copy special character column names like for,breaks,and next so we build a different function for it. the difference is only in dictionary names its follow the modified column names in train_tune and test_tune

textcleaner_2 <- function(x){
  x <- as.character(x)
  
  x <- x %>%
    str_to_lower() %>%
    replace_contraction() %>%
    replace_internet_slang() %>%
    replace_word_elongation() %>%
    replace_number(remove = T) %>%
    replace_date(replacement = "") %>%
    str_remove_all(pattern = "[[:punct:]]") %>%
    str_squish() %>%
    str_trim()
  
  xdtm <- VCorpus(VectorSource(x)) %>%
    tm_map(removeWords, stopwords("en")) %>%
    tm_map(stemDocument) %>% 
    DocumentTermMatrix(control = list(
      dictionary = names(train_tune)
    ))
  
  dfx <- as.data.frame(as.matrix(xdtm), stringAsFactors=F)
    
}

samptext_2 <- textcleaner_2(text)

predict(mod.rf.2.x,samptext_2)

The random forest model (best model in this case) also predict the lyric as negative-mood song.

Naive Bayes

we’ve different format for Naive Bayes. we’ll also build function to clean up the text and matching it to Naive Bayes requirement. the only different is in the last step we apply benoulli converter and return it as transposed matrix.

textcleaner_nb <- function(x){
  x <- as.character(x)
  
  x <- x %>%
    str_to_lower() %>%
    replace_contraction() %>%
    replace_internet_slang() %>%
    replace_word_elongation() %>%
    replace_number(remove = T) %>%
    replace_date(replacement = "") %>%
    str_remove_all(pattern = "[[:punct:]]") %>%
    str_squish() %>%
    str_trim()
  
  xdtm <- VCorpus(VectorSource(x)) %>%
    tm_map(removeWords, stopwords("en")) %>%
    tm_map(stemDocument) %>% 
    DocumentTermMatrix(control = list(
      dictionary = names(train)
    ))
  
  dfx <- as.data.frame(as.matrix(xdtm), stringAsFactors=F)
  
  dfx <- apply(dfx,2,bernoulli_conv)

  return(t(as.matrix(dfx)))    
}

Predict the lyrics using naive bayes model

samptext_nb <- textcleaner_nb(text)

predict(mod.nb,samptext_nb, type = "class")

## [1] negative
## Levels: negative positive

Naive Bayes Model are also predict the lyric as negative mood music. if you hear the real song, its actually a spirit, energic, and positive mood music but i never know what’s the lyrics actually say.

thank you!

Lyric mood identifier

jojoecp

5/4/2020

Background

Objective

Libraries

Let’s begin

Data Import

Data Wrangling

Feature Engineering

Text cleaning

`textclean` x `stringr` x `tm`

Modeling

Naive Bayes

Decision Tree

Mars

Random Forest

Model Tuning

Decision Tree

Random Forest

Model Evaluation and conclusion

Predicting new given lyric

Model type_1

Model type_2

Naive Bayes

Lyric mood identifier

jojoecp

5/4/2020

Background

Objective

Libraries

Let’s begin

Data Import

Data Wrangling

Feature Engineering

Text cleaning

textclean x stringr x tm

Modeling

Naive Bayes

Decision Tree

Mars

Random Forest

Model Tuning

Decision Tree

Random Forest

Model Evaluation and conclusion

Predicting new given lyric

Model type_1

Model type_2

Naive Bayes

`textclean` x `stringr` x `tm`