knitr::include_graphics('Francesco.JPG')
…
This project proposes and seeks to show that song lyrics can be a determing factor of a song’s genre. I do not discount the musical power of harmony, melody, and rhythm to differentiate genre, but it’s not common for musicians, educators and others to classify music limited to a song’s lyrics. I accept this challenge.
Let’s look at potential uses. Businesses such at Pandora and Spotify have been analyzing and classifying music (including songs with lyrics) for their customers, to provide an inclusive listening experience. But, many other business can benefit from music and lyrics analysis.
Let’s look at songwriters, their management, licensing companies such as ASCAP and BMI, and music libraries such as HD Publishing.
For songwriters, there is a sub-group that are only lyricists, meaning someone else composes the music. Composers look for lyrics to compose to. Lyricists look for composers who are known for specific genres. Many haven’t worked together before. Managers often have knowledge of both sides, and will broker deals that combine their talents. Giving songwriters and composers the ability look for each other by matching lyrics with music could be a further democratization of the music industry. It could remove the role of the manager (and associated fees) to connect songwriters with music composers and bands.
Licensing companies deal with both lyrics and songs as legal assets they manage for artists. While this is not the main part of their business, they would benefit from validating the lyric’s genre compared to what the artist might think it is.
When another artist wants to record someone else’s song, they have to pay license fees to a licensing company such as ASCAP. ASCAP has a website that deals with this from a financial aspect, but not from a search and song selection aspect. Having this capability would enhance their service offering to their clients, would be a competitive differentiator; and potentially increase profits through “premium” chargeable services.
The goal of this project is to use data science to analyze a library of popular song lyrics, put the data in a clean and useable form, and build and train a model to classify the words that best characterize the song’s genre. Ultimately, I plan to be able to use song lyrics not in the sample DB to test the model, using lyrics from major artists and others.
I will use the readily available subset of lyrics from The Million Song library. My sample size will be 57,000 songs from multiple genres and artists. From this, I will cut it into training and test data to build and test my model. I will use packages that are designed to be used for Natural Language Processing (NLP) including tidytext, syuzhet(sentiment) and others.
To train the prediction, I will need to give the model the most likely genre for each song. This is known as supervised training. Unfortunately, this is not provided with the freely available data. Due to the huge task to select the genre for each of the 57,000 songs, I will focus on the artist’s best known genre using multiple readily available sources that categorize music. I have done this for 632 artists.
Note: I have combined some genres to make the analysis work better due to similarities in genres.
There are now 19 genres in my data.
Read 57,000 song dataset from an online .csv file https://www.researchgate.net/publication/220723656_The_Million_Song_Dataset
Read my artist-to-genre mapping .csv file
Join them together into one dataframe
Clean the data
Make Ready for Next Steps by adding an id variable and dropping the URL link variable which is not needed for analysis.
# head(raw_dat)
# kable(head(raw_dat))
#column_spec(5:7, bold = T) %>%
#row_spec(3:5, bold = T, color = "white", background = "#D7261E")
# table_output(raw_data)
Compute: Number of Words, Letters, and Average Word Length per Song.
dat = raw_dat %>%
mutate(sentiment = get_sentiment(text),
number_of_words = stri_count(text, regex="\\S+"),
number_of_letters = nchar(text),
avg_word_length = number_of_letters / number_of_words)
table_output(raw_dat)
| artist | song | text | genre | id |
|---|---|---|---|---|
| ABBA | Ahe’s My Kind Of Girl | Look at he | Pop | 1 |
| ABBA | Andante, Andante | Take it ea | Pop | 2 |
| ABBA | As Good As New | I’ll never | Pop | 3 |
| ABBA | Bang | Making som | Pop | 4 |
| ABBA | Bang-A-Boomerang | Making som | Pop | 5 |
| ABBA | Burning My Bridges | Well, you | Pop | 6 |
| ABBA | Cassandra | Down in th | Pop | 7 |
| ABBA | Chiquitita | Chiquitita | Pop | 8 |
| ABBA | Crazy World | I was out | Pop | 9 |
| ABBA | Crying Over You | I’m waitin | Pop | 10 |
table_output(dat)
| artist | song | text | genre | id | sentiment | number_of_words | number_of_letters | avg_word_length |
|---|---|---|---|---|---|---|---|---|
| ABBA | Ahe’s My Kind Of Girl | Look at he | Pop | 1 | 3.95 | 153 | 761 | 4.973856 |
| ABBA | Andante, Andante | Take it ea | Pop | 2 | 5.90 | 260 | 1434 | 5.515385 |
| ABBA | As Good As New | I’ll never | Pop | 3 | 4.15 | 312 | 1477 | 4.733974 |
| ABBA | Bang | Making som | Pop | 4 | 7.95 | 200 | 1247 | 6.235000 |
| ABBA | Bang-A-Boomerang | Making som | Pop | 5 | 8.20 | 198 | 1263 | 6.378788 |
| ABBA | Burning My Bridges | Well, you | Pop | 6 | -2.10 | 109 | 570 | 5.229358 |
| ABBA | Cassandra | Down in th | Pop | 7 | -7.10 | 361 | 2028 | 5.617729 |
| ABBA | Chiquitita | Chiquitita | Pop | 8 | -2.50 | 304 | 1591 | 5.233553 |
| ABBA | Crazy World | I was out | Pop | 9 | -0.45 | 304 | 1533 | 5.042763 |
| ABBA | Crying Over You | I’m waitin | Pop | 10 | -0.25 | 123 | 637 | 5.178862 |
dat %>%
filter(sentiment != 0) %>%
group_by(genre) %>%
summarize(mean_sentiment = mean(sentiment)) %>%
ggplot(aes(x = fct_reorder(genre, mean_sentiment), y = mean_sentiment, fill=mean_sentiment)) +
geom_col() +
coord_flip() +
labs(x = '', y = 'Sentiment', title = 'Lyric Sentiment by Genre') +
theme(legend.position = 'none')
The genres with the most positive sentiment are: Religous, R&B, and Blues.
The genres with the most negative sentiment are: Hip-Hop-Rap and Metal (just slightly negative).
Jazz is close to neutral
dat %>%
ggplot(aes(x = sentiment, col = genre, fill = genre)) +
geom_density(alpha = 0.5) +
labs(x = 'Sentiment', y = 'Density', title = 'Distribution of Sentiment by Genre')
The distribution plot gives us a deeper looking into the genres’ sentiment. All genres seem to have a somewhat normal distribution of sentiments (except Hip-Hop-Rap), but have different means and SDs. Hip-Hop-Rap and Metal are the most left skewed (negative).
Exceeded R’s object capacity when unnesting all 57,000 songs (millions of words).
Need to break into 4 pieces, unnest into words, and stitch back together
Keep the important words per genre included.
# Break text_dat_raw into 4 parts
text_dat_raw_1 = raw_dat[1:10000,]
text_dat_raw_2 = raw_dat[10001:20000,]
text_dat_raw_3 = raw_dat[20001:30000,]
text_dat_raw_4 = raw_dat[30001:nrow(raw_dat),]
#Unnest all 4 parts and remove stop_words
text_dat_raw_1 = text_dat_raw_1%>%
unnest_tokens(word, text)
text_dat_raw_2 = text_dat_raw_2%>%
unnest_tokens(word, text)
text_dat_raw_3 = text_dat_raw_3%>%
unnest_tokens(word, text)
text_dat_raw_4 = text_dat_raw_4%>%
unnest_tokens(word, text)
# Combine 4 parts back into 1
text_dat_raw = text_dat_raw_1 %>%
bind_rows(text_dat_raw_2) %>%
bind_rows(text_dat_raw_3) %>%
bind_rows(text_dat_raw_4)
all_ids = text_dat_raw %>% filter(!is.na(id)) %>% select(id) %>% unique()
train_ids = sample_frac(all_ids, 0.75)
test_ids = all_ids %>% anti_join(train_ids)
text_dat_raw_train = train_ids %>% left_join(text_dat_raw)
text_dat_grouped = text_dat_raw_train %>%
group_by(genre, id, word) %>%
summarize(n = 1) %>%
group_by(genre, word) %>%
summarize(n = n()) %>%
group_by(genre) %>%
top_n(50) %>%
ungroup() %>%
group_by(word) %>%
mutate(wdcount = n()) %>%
ungroup() %>%
filter(wdcount == 1) %>%
select(word)
text_dat_spread = text_dat_grouped %>%
left_join(text_dat_raw, by = 'word') %>%
select(id, word) %>%
distinct(id, word) %>%
mutate(n = 1) %>%
spread(key = word, value = n)
text_dat_spread[is.na(text_dat_spread)]=0
final_dat = dat %>% select(-song) %>%
left_join(text_dat_spread, by = 'id') %>%
select(-text, -artist) %>%
janitor::clean_names() %>%
rename(Class = genre) %>%
mutate(Class = as.factor(Class))
final_dat[is.na(final_dat)] = 0
table_output(final_dat)
| Class | id | sentiment | number_of_words | number_of_letters | avg_word_length | aint | as | back | chorus | god | ill | jesus | let | lord | name | praise | take | well | yeah |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pop | 1 | 3.95 | 153 | 761 | 4.973856 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pop | 2 | 5.90 | 260 | 1434 | 5.515385 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| Pop | 3 | 4.15 | 312 | 1477 | 4.733974 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pop | 4 | 7.95 | 200 | 1247 | 6.235000 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Pop | 5 | 8.20 | 198 | 1263 | 6.378788 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Pop | 6 | -2.10 | 109 | 570 | 5.229358 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| Pop | 7 | -7.10 | 361 | 2028 | 5.617729 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pop | 8 | -2.50 | 304 | 1591 | 5.233553 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pop | 9 | -0.45 | 304 | 1533 | 5.042763 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Pop | 10 | -0.25 | 123 | 637 | 5.178862 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
training_split = 0.75
smp_size = floor(training_split * nrow(final_dat))
dat_index = sample(seq_len(nrow(final_dat)), size = smp_size)
dat_train = as.data.frame(final_dat[dat_index,])
dat_train = train_ids %>% left_join(final_dat)
dat_test = test_ids %>% left_join(final_dat)
dat_train = upSample(x = dat_train %>% select(-Class),
y = dat_train$Class)
dat_train = dat_train %>%
select(-id)
table_output(dat_train)
| sentiment | number_of_words | number_of_letters | avg_word_length | aint | as | back | chorus | god | ill | jesus | let | lord | name | praise | take | well | yeah | Class |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -1.70 | 134 | 787 | 5.873134 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | Blues |
| -0.15 | 96 | 506 | 5.270833 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | Blues |
| 0.45 | 85 | 492 | 5.788235 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Blues |
| 1.25 | 133 | 716 | 5.383459 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Blues |
| 3.05 | 119 | 681 | 5.722689 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Blues |
| 1.95 | 239 | 1210 | 5.062761 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Blues |
| 3.00 | 384 | 2232 | 5.812500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | Blues |
| 1.35 | 162 | 853 | 5.265432 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Blues |
| 3.30 | 208 | 1088 | 5.230769 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | Blues |
| -0.20 | 163 | 895 | 5.490798 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | Blues |
dat_test = dat_test %>%
select(-id)
table_output(dat_test)
| Class | sentiment | number_of_words | number_of_letters | avg_word_length | aint | as | back | chorus | god | ill | jesus | let | lord | name | praise | take | well | yeah |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pop | 0.00 | 345 | 1823 | 5.284058 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pop | 0.65 | 368 | 2169 | 5.894022 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pop | 2.05 | 237 | 1279 | 5.396624 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Pop | 1.00 | 356 | 1892 | 5.314607 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| Pop | 1.00 | 248 | 1457 | 5.875000 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pop | 5.90 | 206 | 1082 | 5.252427 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pop | 5.85 | 383 | 2408 | 6.287206 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Pop | 1.85 | 214 | 1208 | 5.644860 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pop | -1.25 | 126 | 729 | 5.785714 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pop | 1.85 | 202 | 1071 | 5.301980 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
train_control = trainControl(method = "cv")
model_rf = train(dat_train %>% select(-Class),
dat_train$Class,
method = "ranger",
num.trees = 50,
importance = "impurity",
trControl = train_control)
predictions_rf = predict(model_rf, dat_test)
confusionMatrix(predictions_rf, as.factor(dat_test$Class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Blues Country D-DJ-E-D-E-H Folk Hip-Hop-Rap Jazz Metal Pop
## Blues 4 10 2 10 1 1 1 32
## Country 14 86 10 52 4 17 4 170
## D-DJ-E-D-E-H 1 3 3 8 1 0 1 18
## Folk 10 49 6 49 1 14 9 140
## Hip-Hop-Rap 1 7 1 7 136 1 0 81
## Jazz 3 14 1 17 2 12 4 63
## Metal 1 4 0 5 1 2 6 25
## Pop 69 260 73 305 73 76 59 1542
## R&B 2 8 3 10 5 2 1 45
## Religious 3 20 1 9 3 3 0 46
## Rock 77 246 51 267 24 79 86 1030
## Reference
## Prediction R&B Religious Rock
## Blues 2 0 27
## Country 15 17 180
## D-DJ-E-D-E-H 0 1 21
## Folk 12 8 152
## Hip-Hop-Rap 17 1 35
## Jazz 6 4 47
## Metal 3 0 26
## Pop 110 38 1083
## R&B 9 1 37
## Religious 2 129 34
## Rock 56 32 1492
##
## Overall Statistics
##
## Accuracy : 0.377
## 95% CI : (0.367, 0.387)
## No Information Rate : 0.347
## P-Value [Acc > NIR] : 1.004e-09
##
## Kappa : 0.1372
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Blues Class: Country Class: D-DJ-E-D-E-H
## Sensitivity 0.0216216 0.121641 0.0198675
## Specificity 0.9904603 0.943130 0.9940325
## Pos Pred Value 0.0444444 0.151142 0.0526316
## Neg Pred Value 0.9801317 0.928050 0.9838128
## Prevalence 0.0201087 0.076848 0.0164130
## Detection Rate 0.0004348 0.009348 0.0003261
## Detection Prevalence 0.0097826 0.061848 0.0061957
## Balanced Accuracy 0.5060410 0.532385 0.5069500
## Class: Folk Class: Hip-Hop-Rap Class: Jazz
## Sensitivity 0.066306 0.54183 0.057971
## Specificity 0.952606 0.98313 0.982097
## Pos Pred Value 0.108889 0.47387 0.069364
## Neg Pred Value 0.921143 0.98710 0.978398
## Prevalence 0.080326 0.02728 0.022500
## Detection Rate 0.005326 0.01478 0.001304
## Detection Prevalence 0.048913 0.03120 0.018804
## Balanced Accuracy 0.509456 0.76248 0.520034
## Class: Metal Class: Pop Class: R&B Class: Religious
## Sensitivity 0.0350877 0.4831 0.0387931 0.55844
## Specificity 0.9925795 0.6428 0.9872881 0.98651
## Pos Pred Value 0.0821918 0.4181 0.0731707 0.51600
## Neg Pred Value 0.9819218 0.7007 0.9754324 0.98860
## Prevalence 0.0185870 0.3470 0.0252174 0.02511
## Detection Rate 0.0006522 0.1676 0.0009783 0.01402
## Detection Prevalence 0.0079348 0.4009 0.0133696 0.02717
## Balanced Accuracy 0.5138336 0.5629 0.5130406 0.77248
## Class: Rock
## Sensitivity 0.4761
## Specificity 0.6789
## Pos Pred Value 0.4337
## Neg Pred Value 0.7149
## Prevalence 0.3407
## Detection Rate 0.1622
## Detection Prevalence 0.3739
## Balanced Accuracy 0.5775
model_rf$finalModel %>%
# extract variable importance metrics
ranger::importance() %>%
# convert to a data frame
enframe(name = "variable", value = "varimp") %>%
top_n(n = 20, wt = varimp) %>%
# plot the metrics
ggplot(aes(x = fct_reorder(variable, varimp), y = varimp, fill=varimp)) +
geom_col() +
coord_flip() +
labs(x = "Token",
y = "Variable importance (higher is more important)")
a = dat_train %>% as_tibble()
model_rf_data_structure = a[0,]
timestamp = Sys.time()
saveRDS(model_rf_data_structure, paste0('models/model_rf_data_structure - ', timestamp, '.rds'))
saveRDS(model_rf, file = paste0('models/model_rf - ', timestamp, '.rds'))
saveRDS(model_rf_data_structure, paste0('models/model_rf_data_structure', '.rds'))
saveRDS(model_rf, file = paste0('models/model_rf', '.rds'))
set.seed(1234)
word_count = text_dat_raw %>%
filter(genre == "Rock" | genre == "Hip-Hop-Rap" | genre == "Religious" | genre == "Country" | genre == "Pop") %>%
anti_join(stop_words) %>%
count(word, genre, sort = TRUE) %>%
mutate(genre = as.character(genre))
word_count %>%
reshape2::acast(word ~ genre, value.var = "n", fill = 0) %>%
comparison.cloud(colors = brewer.pal(12,"Dark2"),
max.words = 200,
title.size = 2,
scale = c(0.2,1.5))
title(main = "Word Cloud from Lyrics")
To a resonable degree, lyrics alone can be used to predict a song’s genre. I acheived a 38% prediction accuracy across the ? genres in my statistical prediction model. This is better than pure guessing (which would be closer to 5% for 19 genres in the data). And some genres can be predicted better than others.
Here are the results of the top 11 genres:
Engineered variables including number of words, number of letters, and sentiment led the list of importance in the prediction model. While down, like, ain’t, baby, and love were the most important words.
Most genres have a positive sentiment with the exception of Hip-Hop-Rap, and Metal. More precise analysis could further differentiate sentiment into categories such foul language, cultural meanings, and other categories. For example Raggae music commonly uses words such as ra and wit.
Random Forest is a good prediction model for this Natural Language Processing project. And further tuning might get the accuracy higher, but considerable compute resources are needed to run this model multiple times to converge on an optimal tuning of the model.
My secondary Neural Networks model produced mixes results. The overall accuracy was only 10%, but I achieved some higher accuracies than Random Forest for some genres. Due to this, my report focuses on my Random Forest model and results. And, I did use a Random Forest pre-processing method for the Neural Networks model, showing that they can work together.
Dealing with Big Data in this project was a challenge, but taught me a great deal about several techniques to help with both the data preparation and model running stages. With Big Data there is no substitute for more compute power and memory. But, even with those, limitations in the R language required techniques to work around them.
Combining my lyrics prediction analysis with existing music analysis would likely produce an even higher prediction accuracy. This will be a personal project for me after I leave this certification training.
On a personal note, Data Science has sparked an interest in me to seek work in this field that combines it with my previous experience.