This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
#{r} install.packages("DataCombine") #
library('tidyverse')
## ── Attaching packages ───────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library('tidytext')
library('tidyr')
library('scales')
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library('wordcloud')
## Loading required package: RColorBrewer
library('reshape2')
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library('textdata')
library('DataCombine')
data("stop_words")
A data science analysis by Francesco Bonifazi. This project proposes and seeks to show that song lyrics can be a determing factor of the genre. We do not discount the power of harmony, melody, and rhythm to differentiate genre, but, it’s not common for musicians, educators and the like to classify music by the lyrics to songs. # GOAL: The goal of this project is to use data science to analyze a library of Popular song lyrics, build/train a model to classify the words that best characterize the song’s genre. Ultimately, I plan to be able to use song lyrics not in the sample DB to test the model. Including ones from major artists and beyond. #### Action Plan: I will use the readily available subset of lyrics from The Million Song library. My sample size will be 57,000 songs of multiple genres and artists. From this, I will cut it into training and test data starting at .75 training. To train the prediction, I will need the most likely genre for each song. Unfortunately, this is not provided with the freely available data. Due to the huge task to select the genre for each of the 57,000 songs, I will focus on the artist for each group of songs using multiple sources wher they categorize music. I will also list a “sub-genre” in-case this is more effective to use. # Business Case: WHy would anyone or any company care whether lyrics define the song’s genre? Let’s look at potential company use. Businesses such at Pandora and Spotify have been analyzing and classifying music (including songs with lyrics) for their customers to provide a “premium” listening experience. Seems that the common thought is that most listeners don’t have a wide range of music they enjoy, so keeping them on the service relies in part to providing recommendations, and even “next-up” of songs. While these algorhythms are valuable corporate secrets, to my knowledge none have focused on lyrics exclusively. It would benefit these music service providers to complement their own AI/ML solutions with another one focused on lyrics. This would offer both comfirmation of music analysis as well as new insights. Next, let’s look at songwriters, their management, and licensing companies such as ASCAP and BMI. For songwriters, there is a sub-group that are only lyricists, meaning someone else composes the music. Composers look for lyrics to compose to. Lyricists look for composers who are known for specific genres. Many, haven’t worked together before. Managers often have knowledge of both sides, and will broaker deals that combine their talents. Giving songwriters and composers the ability look for each other by matching lyrics with music could be a further democratization of the music industry. Licensing companies deal with both lyrics and songs as legal assets they manage for artists. While this is not the main part of their business, they would benefit from validating the lyric’s genre compared to what the artist thinks it is. When another artist wants to record someone else’s song, they had to pay license fees to ASCAP for example. ASCAP has a website that deals with this from a financial aspect, but not from a search and select song aspect. Having this capability would enhance their service offering to their clients, be a competitive differentiator, and potentially increase profits through “premium” services. High-Level Summary of Restuls" To be written late!!!!
Read 57,000 song dataset froma .csv file online Read my artist to genre .csv file online
Join them together into one dataframe name “songdata”
raw_song <- read_csv("https://foco-ds-portal-files.s3.amazonaws.com/songdata.csv")
## Parsed with column specification:
## cols(
## artist = col_character(),
## song = col_character(),
## link = col_character(),
## text = col_character()
## )
raw_genre = read_csv("https://foco-ds-portal-files.s3.amazonaws.com/Artists_Genre_Mapping.csv")
## Parsed with column specification:
## cols(
## Band = col_character(),
## `Genre Updated` = col_character()
## )
songdata = raw_song %>%
left_join(raw_genre, by = c('artist' = 'Band'))
class(songdata)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Looks like a data frame Try converting to a tibble
songdata_tibble <- as_tibble(songdata)
class(songdata_tibble)
## [1] "tbl_df" "tbl" "data.frame"
Looks the same… but, might be different internally? Count the rows
songdata_tibble %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 57668
colnames(songdata_tibble)
## [1] "artist" "song" "link" "text"
## [5] "Genre Updated"
songdata_tibble_sm = rename(songdata_tibble, "Genre" = "Genre Updated") %>%
filter(!is.na(Genre)) %>%
select(artist, song, text, Genre) %>%
tibble::rowid_to_column("ID")
colnames(songdata_tibble_sm)
## [1] "ID" "artist" "song" "text" "Genre"
There are 57,650 rows of data in this dataset = each row is a song.
songdata_tibble_sm %>%
select(artist) %>%
unique()
## # A tibble: 359 x 1
## artist
## <chr>
## 1 ABBA
## 2 Ace Of Base
## 3 Adam Sandler
## 4 Adele
## 5 Aerosmith
## 6 Air Supply
## 7 Aiza Seguerra
## 8 Alabama
## 9 Alice Cooper
## 10 Alice In Chains
## # … with 349 more rows
There are 643 artists in this dataset If I need 30% for test data = 129 artists.
songdata_tibble_sm %>%
select(Genre) %>%
unique()
## # A tibble: 29 x 1
## Genre
## <chr>
## 1 Pop
## 2 Comedy
## 3 Rock
## 4 Americana
## 5 Religious
## 6 Musicals
## 7 Folk
## 8 Soundtrack
## 9 Soul
## 10 Jazz
## # … with 19 more rows
#FB: There are 29 Genres… need to remove the ONE NA
head (songdata_tibble_sm)
## # A tibble: 6 x 5
## ID artist song text Genre
## <int> <chr> <chr> <chr> <chr>
## 1 1 ABBA Ahe's My Kind… "Look at her face, it's a wonderful fa… Pop
## 2 2 ABBA Andante, Anda… "Take it easy with me, please \nTouch… Pop
## 3 3 ABBA As Good As New "I'll never know why I had to go \nWh… Pop
## 4 4 ABBA Bang "Making somebody happy is a question o… Pop
## 5 5 ABBA Bang-A-Boomer… "Making somebody happy is a question o… Pop
## 6 6 ABBA Burning My Br… "Well, you hoot and you holler and you… Pop
How many ABBA songs?
songdata_tibble_sm %>%
filter (artist == "ABBA") %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 113
113 ABBA Songs!
Break up the song lyrics (text) into words for ABBA
songdata_tibble_sm %>%
filter (artist == "ABBA") %>%
unnest_tokens(word, text)
## # A tibble: 28,724 x 5
## ID artist song Genre word
## <int> <chr> <chr> <chr> <chr>
## 1 1 ABBA Ahe's My Kind Of Girl Pop look
## 2 1 ABBA Ahe's My Kind Of Girl Pop at
## 3 1 ABBA Ahe's My Kind Of Girl Pop her
## 4 1 ABBA Ahe's My Kind Of Girl Pop face
## 5 1 ABBA Ahe's My Kind Of Girl Pop it's
## 6 1 ABBA Ahe's My Kind Of Girl Pop a
## 7 1 ABBA Ahe's My Kind Of Girl Pop wonderful
## 8 1 ABBA Ahe's My Kind Of Girl Pop face
## 9 1 ABBA Ahe's My Kind Of Girl Pop and
## 10 1 ABBA Ahe's My Kind Of Girl Pop it
## # … with 28,714 more rows
26,724 words in all the ABBA songs! But, there are repeats and throw-away words “a, i etc.” in this list.
Search for no. times “girl” is used Use tidy text’s unnest function to arrange one word per row.
songdata_tibble_sm %>%
filter (artist == "ABBA") %>%
group_by(song) %>%
unnest_tokens(word, text) %>%
filter(word == "girl") %>%
count()
## # A tibble: 2 x 2
## # Groups: song [2]
## song n
## <chr> <int>
## 1 I Have A Dream 1
## 2 Man In The Middle 1
2 ABBA songs have the word “girl” in them…??? Seems low. Out of 113 ABBA songs… but, it misssed their 1st song “Ahe’s My Kind Of Girl” which I checked has “girl” 4 times. I have a lot to learn about text parsing… Let’s get rid of words such as “the”, “a” etc. Called “stop_words” It seemed to choke on all rows….doing it for ABBA to see results.
songdata_tibble_sm %>%
filter (artist == "ABBA") %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 1,917 x 2
## word n
## <chr> <int>
## 1 la 376
## 2 love 187
## 3 gonna 86
## 4 time 81
## 5 feel 80
## 6 life 79
## 7 baby 69
## 8 day 68
## 9 girl 68
## 10 ah 67
## # … with 1,907 more rows
1,917 unique words in all ABBA songs. All words are used at least twice ABBA’s in this catelog. No single word is unique to one song. This shows girl used 68 times!
songdata_tibble_sm %>%
filter (artist == "ABBA") %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word,n)) %>%
filter(n > 50) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab("Words in song") +
coord_flip()
## Joining, by = "word"
The distribution is interesting: “la” and “love” are the most used words, there are many words used at least 50 times.
This shows ’girl" is used by ABBA over 50 times! I want to cut this song list down for training data, and will increase this size if I don’t find interesting trends. I want to use a randon selection since this is an alhpabetical list.
set.seed(123)
training_split = 0.75 #75% data for train, 25% for test
sampled_fraction = 0.10 #Down size while building analysis
train_dat = songdata_tibble_sm %>% sample_frac(training_split * sampled_fraction)
test_dat = songdata_tibble_sm %>% anti_join(train_dat, by = 'song')
head(train_dat)
## # A tibble: 6 x 5
## ID artist song text Genre
## <int> <chr> <chr> <chr> <chr>
## 1 2986 Cliff Rich… And I Love Her "I give her all my love \nT… Pop
## 2 29925 Keith Urban Sweet Thing "When I picked you up for ou… Mod-Co…
## 3 29710 Justin Tim… Like I Love You "Just something about you \… Pop
## 4 37529 Uriah Heep Fools "I know this feeling inside … Rock
## 5 2757 Christina … Over The River A… "Over the river \nAnd throu… Pop
## 6 9642 Leonard Co… Show Me The Place "Show me the place, where yo… Folk
Let’s take a look at the lyrics in the training data… Starting with Dolly Parton:
train_dat %>%
group_by(artist) %>%
count(artist, sort = TRUE)
## # A tibble: 344 x 2
## # Groups: artist [344]
## artist n
## <chr> <int>
## 1 Air Supply 21
## 2 Kiss 21
## 3 Michael Jackson 21
## 4 Cher 20
## 5 Hank Snow 19
## 6 Moody Blues 19
## 7 Beautiful South 18
## 8 Dolly Parton 18
## 9 Green Day 18
## 10 John Denver 18
## # … with 334 more rows
train_dat_unnest = train_dat %>%
filter (artist == "Dolly Parton") %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word,n)) %>%
filter(n > 1)
## Joining, by = "word"
train_dat_unnest %>%
filter(n > 5) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab("Words in Dolly Parton's songs") +
coord_flip()
Dolly’s words are somewhat different - not “girl” though, which makes sence since she’s a femail singer and Gay songs are common in Country genre!
*** - I need a function to pass all artist’s songs to and find the most used words.
Let’s look at the frequency of words for all the songs in the training_dat using the afinn sentiment library.
afinn = get_sentiments("afinn")
#bing = get_sentiments("bing")
#loughran = get_sentiments("loughran")
head(train_dat_unnest)
## # A tibble: 6 x 2
## word n
## <fct> <int>
## 1 love 39
## 2 crazy 22
## 3 drive 22
## 4 downtown 21
## 5 cry 18
## 6 family 18
Unnest all of train_dat and removel stop_words
train_dat_unnest = train_dat %>%
group_by(artist) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
head(train_dat_unnest)
## # A tibble: 6 x 5
## # Groups: artist [1]
## ID artist song Genre word
## <int> <chr> <chr> <chr> <chr>
## 1 2986 Cliff Richard And I Love Her Pop love
## 2 2986 Cliff Richard And I Love Her Pop love
## 3 2986 Cliff Richard And I Love Her Pop love
## 4 2986 Cliff Richard And I Love Her Pop love
## 5 2986 Cliff Richard And I Love Her Pop tenderly
## 6 2986 Cliff Richard And I Love Her Pop kiss
sent_dat = train_dat_unnest %>%
left_join(afinn, by = 'word') %>%
rename(sentiment = value)
sent_dat
## # A tibble: 208,857 x 6
## # Groups: artist [344]
## ID artist song Genre word sentiment
## <int> <chr> <chr> <chr> <chr> <dbl>
## 1 2986 Cliff Richard And I Love Her Pop love 3
## 2 2986 Cliff Richard And I Love Her Pop love 3
## 3 2986 Cliff Richard And I Love Her Pop love 3
## 4 2986 Cliff Richard And I Love Her Pop love 3
## 5 2986 Cliff Richard And I Love Her Pop tenderly NA
## 6 2986 Cliff Richard And I Love Her Pop kiss 2
## 7 2986 Cliff Richard And I Love Her Pop lover NA
## 8 2986 Cliff Richard And I Love Her Pop brings NA
## 9 2986 Cliff Richard And I Love Her Pop brings NA
## 10 2986 Cliff Richard And I Love Her Pop love 3
## # … with 208,847 more rows
Many NAs… this will rain havock on my analysis!
I need to add words to the afinn list. - baby, call, music, row, hold, em, songs, dolly, hear, life, night, parton, time, walk, watched, wrote, abraham, anymore, believes, birmingham, boulder, change, comin, color, day, door, eternity, feel, fits, goin, gonna, guitar, half, hummin, nashville, ol, rca, rock, saving, soul, standing, strummin, till, torch, wait, wash, world = 0 - blue, blues = -2 - heaven’s = 3 - heart, star = 2 - star = 2 - burn = 0 (could burn good or bad for you baby! - burning = 3 - smart = 2 - bosom, honey, touch = 1 - fire = -2????? “Come on light my fire” is good. “I’m going to fire you” is bad…
TEMPORARILY Convert rows with no Genre in sent_dat to 0
sent_dat_noNAs = sent_dat %>%
group_by(Genre, sentiment) %>%
filter(!is.na(Genre))
sent_dat_noNAs$sentiment[is.na(sent_dat_noNAs$sentiment)] = 0
sent_dat_noNAs
## # A tibble: 208,857 x 6
## # Groups: Genre, sentiment [242]
## ID artist song Genre word sentiment
## <int> <chr> <chr> <chr> <chr> <dbl>
## 1 2986 Cliff Richard And I Love Her Pop love 3
## 2 2986 Cliff Richard And I Love Her Pop love 3
## 3 2986 Cliff Richard And I Love Her Pop love 3
## 4 2986 Cliff Richard And I Love Her Pop love 3
## 5 2986 Cliff Richard And I Love Her Pop tenderly 0
## 6 2986 Cliff Richard And I Love Her Pop kiss 2
## 7 2986 Cliff Richard And I Love Her Pop lover 0
## 8 2986 Cliff Richard And I Love Her Pop brings 0
## 9 2986 Cliff Richard And I Love Her Pop brings 0
## 10 2986 Cliff Richard And I Love Her Pop love 3
## # … with 208,847 more rows
Find the sum of sentiment for each Genre Change Genre NAs to zeros
sent_dat_noNAs %>%
filter(sentiment != 0) %>%
group_by(Genre) %>%
select(sentiment) %>%
ggplot(aes(x = sentiment)) +
geom_histogram(bins = 10) +
facet_wrap(~Genre, scales = 'free')
## Adding missing grouping variables: `Genre`
Note: the “holes” at 0!
sent_dat_noNAs %>%
filter(sentiment != 0) %>%
group_by(Genre) %>%
summarize(mean_sentiment = mean(sentiment)) %>%
ggplot(aes(x = Genre, y = mean_sentiment)) +
geom_col() +
coord_flip()
Most Genres have a positive sentiment mean but: - Rap, Hip-Hop, Electronica, DJBeats, Dance, Comedy, and Americana are negative. Rock and Country are slightly negative (essentiall neutral)
I need to add words to the affin table to enrich it’s vocabulary for these song.
afinn_row = c("frankieb", 0)
head(afinn_row)
## [1] "frankieb" "0"
Make a dataframe with these words = 0
baby, call, music, row, hold, em, songs, dolly, hear, life, night, parton, time, walk, watched, wrote, abraham, anymore, believes, birmingham, boulder, change, comin, color, day, door, eternity, feel, fits, goin, gonna, guitar, half, hummin, nashville, ol, rca, rock, saving, soul, standing, strummin, till, torch, wait, wash, world
Make a row with all the words to insert into afinn later
afinn_add_data = c("baby", "call", "music", "row", "hold", "em", "songs", "dolly", "hear", "life", "night", "parton", "time", "walk", "watched", "wrote", "abraham", "anymore", "believes", "birmingham", "boulder", "change", "comin", "color", "day", "door", "eternity", "feel", "fits", "goin", "gonna", "guitar", "half", "hummin", "nashville", "ol", "rca", "rock", "saving", "soul", "standing", "strummin", "till", "torch", "wait", "wash", "world")
afinn_add_data
## [1] "baby" "call" "music" "row" "hold"
## [6] "em" "songs" "dolly" "hear" "life"
## [11] "night" "parton" "time" "walk" "watched"
## [16] "wrote" "abraham" "anymore" "believes" "birmingham"
## [21] "boulder" "change" "comin" "color" "day"
## [26] "door" "eternity" "feel" "fits" "goin"
## [31] "gonna" "guitar" "half" "hummin" "nashville"
## [36] "ol" "rca" "rock" "saving" "soul"
## [41] "standing" "strummin" "till" "torch" "wait"
## [46] "wash" "world"
Add words and values into my var afinnPlus for neutral sentiment words
afinnPlus = afinn
for(i in 1:47) {
afinn_row = c(afinn_add_data[i], 0)
afinn_row
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
}
afinnPlus
## # A tibble: 2,524 x 2
## word value
## * <chr> <chr>
## 1 world 0
## 2 wash 0
## 3 wait 0
## 4 torch 0
## 5 till 0
## 6 strummin 0
## 7 standing 0
## 8 soul 0
## 9 saving 0
## 10 rock 0
## # … with 2,514 more rows
Now add new words to afinn with non-zero values blue, blues = -2 - heaven’s = 3 - heart, star = 2 - burn = 0 (could burn good or bad for you baby! - burning = 3 - smart = 2 - bosom, honey, touch = 2
afinn_row = c("blue", -2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
afinn_row = c("blues", -2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
head(afinnPlus)
## # A tibble: 6 x 2
## word value
## <chr> <chr>
## 1 blues -2
## 2 blue -2
## 3 world 0
## 4 wash 0
## 5 wait 0
## 6 torch 0
afinn_row = c("heaven's", 3)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
finn_row = c("heart", 2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
afinn_row = c("star", 2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
afinn_row = c("burn", 0)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
afinn_row = c("burning", 3)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
afinn_row = c("smart", 2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
afinn_row = c("bosom", 2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
afinn_row = c("honey", 2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
afinn_row = c("touch", 2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
head(afinnPlus)
## # A tibble: 6 x 2
## word value
## <chr> <chr>
## 1 touch 2
## 2 honey 2
## 3 bosom 2
## 4 smart 2
## 5 burning 3
## 6 burn 0