R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

#{r} install.packages("DataCombine") #

library('tidyverse')
## ── Attaching packages ───────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library('tidytext')
library('tidyr')
library('scales')
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library('wordcloud')
## Loading required package: RColorBrewer
library('reshape2')
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library('textdata')
library('DataCombine')
data("stop_words")

TITLE: Music by the Numbers!

A data science analysis by Francesco Bonifazi. This project proposes and seeks to show that song lyrics can be a determing factor of the genre. We do not discount the power of harmony, melody, and rhythm to differentiate genre, but, it’s not common for musicians, educators and the like to classify music by the lyrics to songs. # GOAL: The goal of this project is to use data science to analyze a library of Popular song lyrics, build/train a model to classify the words that best characterize the song’s genre. Ultimately, I plan to be able to use song lyrics not in the sample DB to test the model. Including ones from major artists and beyond. #### Action Plan: I will use the readily available subset of lyrics from The Million Song library. My sample size will be 57,000 songs of multiple genres and artists. From this, I will cut it into training and test data starting at .75 training. To train the prediction, I will need the most likely genre for each song. Unfortunately, this is not provided with the freely available data. Due to the huge task to select the genre for each of the 57,000 songs, I will focus on the artist for each group of songs using multiple sources wher they categorize music. I will also list a “sub-genre” in-case this is more effective to use. # Business Case: WHy would anyone or any company care whether lyrics define the song’s genre? Let’s look at potential company use. Businesses such at Pandora and Spotify have been analyzing and classifying music (including songs with lyrics) for their customers to provide a “premium” listening experience. Seems that the common thought is that most listeners don’t have a wide range of music they enjoy, so keeping them on the service relies in part to providing recommendations, and even “next-up” of songs. While these algorhythms are valuable corporate secrets, to my knowledge none have focused on lyrics exclusively. It would benefit these music service providers to complement their own AI/ML solutions with another one focused on lyrics. This would offer both comfirmation of music analysis as well as new insights. Next, let’s look at songwriters, their management, and licensing companies such as ASCAP and BMI. For songwriters, there is a sub-group that are only lyricists, meaning someone else composes the music. Composers look for lyrics to compose to. Lyricists look for composers who are known for specific genres. Many, haven’t worked together before. Managers often have knowledge of both sides, and will broaker deals that combine their talents. Giving songwriters and composers the ability look for each other by matching lyrics with music could be a further democratization of the music industry. Licensing companies deal with both lyrics and songs as legal assets they manage for artists. While this is not the main part of their business, they would benefit from validating the lyric’s genre compared to what the artist thinks it is. When another artist wants to record someone else’s song, they had to pay license fees to ASCAP for example. ASCAP has a website that deals with this from a financial aspect, but not from a search and select song aspect. Having this capability would enhance their service offering to their clients, be a competitive differentiator, and potentially increase profits through “premium” services. High-Level Summary of Restuls" To be written late!!!!

Read 57,000 song dataset froma .csv file online Read my artist to genre .csv file online

Join them together into one dataframe name “songdata”

raw_song <- read_csv("https://foco-ds-portal-files.s3.amazonaws.com/songdata.csv")
## Parsed with column specification:
## cols(
##   artist = col_character(),
##   song = col_character(),
##   link = col_character(),
##   text = col_character()
## )
raw_genre = read_csv("https://foco-ds-portal-files.s3.amazonaws.com/Artists_Genre_Mapping.csv")
## Parsed with column specification:
## cols(
##   Band = col_character(),
##   `Genre Updated` = col_character()
## )
songdata = raw_song %>% 
  left_join(raw_genre, by = c('artist' = 'Band'))

There are 4 columns in this dataset

class(songdata)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Looks like a data frame Try converting to a tibble

songdata_tibble <- as_tibble(songdata)
class(songdata_tibble)
## [1] "tbl_df"     "tbl"        "data.frame"

Looks the same… but, might be different internally? Count the rows

songdata_tibble %>%
count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1 57668
colnames(songdata_tibble)
## [1] "artist"        "song"          "link"          "text"         
## [5] "Genre Updated"

Change “Genre Updated” back to “Genre” as before.

 songdata_tibble_sm = rename(songdata_tibble, "Genre" = "Genre Updated") %>%
  filter(!is.na(Genre)) %>%
  select(artist, song, text, Genre) %>%
  tibble::rowid_to_column("ID")
colnames(songdata_tibble_sm)
## [1] "ID"     "artist" "song"   "text"   "Genre"

There are 57,650 rows of data in this dataset = each row is a song.

songdata_tibble_sm %>%
  select(artist) %>%
  unique()
## # A tibble: 359 x 1
##    artist         
##    <chr>          
##  1 ABBA           
##  2 Ace Of Base    
##  3 Adam Sandler   
##  4 Adele          
##  5 Aerosmith      
##  6 Air Supply     
##  7 Aiza Seguerra  
##  8 Alabama        
##  9 Alice Cooper   
## 10 Alice In Chains
## # … with 349 more rows

There are 643 artists in this dataset If I need 30% for test data = 129 artists.

songdata_tibble_sm %>%
  select(Genre) %>%
  unique()
## # A tibble: 29 x 1
##    Genre     
##    <chr>     
##  1 Pop       
##  2 Comedy    
##  3 Rock      
##  4 Americana 
##  5 Religious 
##  6 Musicals  
##  7 Folk      
##  8 Soundtrack
##  9 Soul      
## 10 Jazz      
## # … with 19 more rows

#FB: There are 29 Genres… need to remove the ONE NA

head (songdata_tibble_sm)
## # A tibble: 6 x 5
##      ID artist song           text                                    Genre
##   <int> <chr>  <chr>          <chr>                                   <chr>
## 1     1 ABBA   Ahe's My Kind… "Look at her face, it's a wonderful fa… Pop  
## 2     2 ABBA   Andante, Anda… "Take it easy with me, please  \nTouch… Pop  
## 3     3 ABBA   As Good As New "I'll never know why I had to go  \nWh… Pop  
## 4     4 ABBA   Bang           "Making somebody happy is a question o… Pop  
## 5     5 ABBA   Bang-A-Boomer… "Making somebody happy is a question o… Pop  
## 6     6 ABBA   Burning My Br… "Well, you hoot and you holler and you… Pop

How many ABBA songs?

songdata_tibble_sm %>%
  filter (artist == "ABBA") %>%
count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1   113

113 ABBA Songs!

Break up the song lyrics (text) into words for ABBA

songdata_tibble_sm %>%
  filter (artist == "ABBA") %>%
unnest_tokens(word, text)
## # A tibble: 28,724 x 5
##       ID artist song                  Genre word     
##    <int> <chr>  <chr>                 <chr> <chr>    
##  1     1 ABBA   Ahe's My Kind Of Girl Pop   look     
##  2     1 ABBA   Ahe's My Kind Of Girl Pop   at       
##  3     1 ABBA   Ahe's My Kind Of Girl Pop   her      
##  4     1 ABBA   Ahe's My Kind Of Girl Pop   face     
##  5     1 ABBA   Ahe's My Kind Of Girl Pop   it's     
##  6     1 ABBA   Ahe's My Kind Of Girl Pop   a        
##  7     1 ABBA   Ahe's My Kind Of Girl Pop   wonderful
##  8     1 ABBA   Ahe's My Kind Of Girl Pop   face     
##  9     1 ABBA   Ahe's My Kind Of Girl Pop   and      
## 10     1 ABBA   Ahe's My Kind Of Girl Pop   it       
## # … with 28,714 more rows

26,724 words in all the ABBA songs! But, there are repeats and throw-away words “a, i etc.” in this list.

Search for no. times “girl” is used Use tidy text’s unnest function to arrange one word per row.

songdata_tibble_sm %>%
  filter (artist == "ABBA") %>%
  group_by(song) %>%
unnest_tokens(word, text) %>%
filter(word == "girl") %>%
  count()
## # A tibble: 2 x 2
## # Groups:   song [2]
##   song                  n
##   <chr>             <int>
## 1 I Have A Dream        1
## 2 Man In The Middle     1

2 ABBA songs have the word “girl” in them…??? Seems low. Out of 113 ABBA songs… but, it misssed their 1st song “Ahe’s My Kind Of Girl” which I checked has “girl” 4 times. I have a lot to learn about text parsing… Let’s get rid of words such as “the”, “a” etc. Called “stop_words” It seemed to choke on all rows….doing it for ABBA to see results.

songdata_tibble_sm %>%
  filter (artist == "ABBA") %>%  
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 1,917 x 2
##    word      n
##    <chr> <int>
##  1 la      376
##  2 love    187
##  3 gonna    86
##  4 time     81
##  5 feel     80
##  6 life     79
##  7 baby     69
##  8 day      68
##  9 girl     68
## 10 ah       67
## # … with 1,907 more rows

1,917 unique words in all ABBA songs. All words are used at least twice ABBA’s in this catelog. No single word is unique to one song. This shows girl used 68 times!

songdata_tibble_sm %>%
  filter (artist == "ABBA") %>%  
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  
  mutate(word = reorder(word,n))  %>%  
  filter(n > 50) %>%
   
  ggplot(aes(word, n)) +
    geom_col() +
  xlab("Words in song") +
coord_flip()
## Joining, by = "word"

The distribution is interesting: “la” and “love” are the most used words, there are many words used at least 50 times.

This shows ’girl" is used by ABBA over 50 times! I want to cut this song list down for training data, and will increase this size if I don’t find interesting trends. I want to use a randon selection since this is an alhpabetical list.

set.seed(123)
training_split = 0.75  #75% data for train, 25% for test
sampled_fraction = 0.10 #Down size while building analysis
train_dat = songdata_tibble_sm %>% sample_frac(training_split * sampled_fraction)
test_dat = songdata_tibble_sm %>% anti_join(train_dat, by = 'song')
head(train_dat)
## # A tibble: 6 x 5
##      ID artist      song              text                          Genre  
##   <int> <chr>       <chr>             <chr>                         <chr>  
## 1  2986 Cliff Rich… And I Love Her    "I give her all my love  \nT… Pop    
## 2 29925 Keith Urban Sweet Thing       "When I picked you up for ou… Mod-Co…
## 3 29710 Justin Tim… Like I Love You   "Just something about you  \… Pop    
## 4 37529 Uriah Heep  Fools             "I know this feeling inside … Rock   
## 5  2757 Christina … Over The River A… "Over the river  \nAnd throu… Pop    
## 6  9642 Leonard Co… Show Me The Place "Show me the place, where yo… Folk

Let’s take a look at the lyrics in the training data… Starting with Dolly Parton:

train_dat %>%
  group_by(artist) %>%
  count(artist, sort = TRUE)
## # A tibble: 344 x 2
## # Groups:   artist [344]
##    artist              n
##    <chr>           <int>
##  1 Air Supply         21
##  2 Kiss               21
##  3 Michael Jackson    21
##  4 Cher               20
##  5 Hank Snow          19
##  6 Moody Blues        19
##  7 Beautiful South    18
##  8 Dolly Parton       18
##  9 Green Day          18
## 10 John Denver        18
## # … with 334 more rows
train_dat_unnest = train_dat %>%
  filter (artist == "Dolly Parton") %>%  
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word,n)) %>%
  filter(n > 1)
## Joining, by = "word"
train_dat_unnest %>%
  filter(n > 5) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab("Words in Dolly Parton's songs") +
  coord_flip()

Dolly’s words are somewhat different - not “girl” though, which makes sence since she’s a femail singer and Gay songs are common in Country genre!

*** - I need a function to pass all artist’s songs to and find the most used words.

Let’s look at the frequency of words for all the songs in the training_dat using the afinn sentiment library.

afinn = get_sentiments("afinn")
#bing = get_sentiments("bing")
#loughran = get_sentiments("loughran")
head(train_dat_unnest)
## # A tibble: 6 x 2
##   word         n
##   <fct>    <int>
## 1 love        39
## 2 crazy       22
## 3 drive       22
## 4 downtown    21
## 5 cry         18
## 6 family      18

Unnest all of train_dat and removel stop_words

train_dat_unnest = train_dat %>%
  group_by(artist) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining, by = "word"
  head(train_dat_unnest)
## # A tibble: 6 x 5
## # Groups:   artist [1]
##      ID artist        song           Genre word    
##   <int> <chr>         <chr>          <chr> <chr>   
## 1  2986 Cliff Richard And I Love Her Pop   love    
## 2  2986 Cliff Richard And I Love Her Pop   love    
## 3  2986 Cliff Richard And I Love Her Pop   love    
## 4  2986 Cliff Richard And I Love Her Pop   love    
## 5  2986 Cliff Richard And I Love Her Pop   tenderly
## 6  2986 Cliff Richard And I Love Her Pop   kiss
sent_dat = train_dat_unnest %>%
  left_join(afinn, by = 'word') %>%
  rename(sentiment = value)
 sent_dat
## # A tibble: 208,857 x 6
## # Groups:   artist [344]
##       ID artist        song           Genre word     sentiment
##    <int> <chr>         <chr>          <chr> <chr>        <dbl>
##  1  2986 Cliff Richard And I Love Her Pop   love             3
##  2  2986 Cliff Richard And I Love Her Pop   love             3
##  3  2986 Cliff Richard And I Love Her Pop   love             3
##  4  2986 Cliff Richard And I Love Her Pop   love             3
##  5  2986 Cliff Richard And I Love Her Pop   tenderly        NA
##  6  2986 Cliff Richard And I Love Her Pop   kiss             2
##  7  2986 Cliff Richard And I Love Her Pop   lover           NA
##  8  2986 Cliff Richard And I Love Her Pop   brings          NA
##  9  2986 Cliff Richard And I Love Her Pop   brings          NA
## 10  2986 Cliff Richard And I Love Her Pop   love             3
## # … with 208,847 more rows

Many NAs… this will rain havock on my analysis!

I need to add words to the afinn list. - baby, call, music, row, hold, em, songs, dolly, hear, life, night, parton, time, walk, watched, wrote, abraham, anymore, believes, birmingham, boulder, change, comin, color, day, door, eternity, feel, fits, goin, gonna, guitar, half, hummin, nashville, ol, rca, rock, saving, soul, standing, strummin, till, torch, wait, wash, world = 0 - blue, blues = -2 - heaven’s = 3 - heart, star = 2 - star = 2 - burn = 0 (could burn good or bad for you baby! - burning = 3 - smart = 2 - bosom, honey, touch = 1 - fire = -2????? “Come on light my fire” is good. “I’m going to fire you” is bad…

TEMPORARILY Convert rows with no Genre in sent_dat to 0

sent_dat_noNAs = sent_dat %>%
  group_by(Genre, sentiment) %>%
  filter(!is.na(Genre))

sent_dat_noNAs$sentiment[is.na(sent_dat_noNAs$sentiment)] = 0
  
sent_dat_noNAs
## # A tibble: 208,857 x 6
## # Groups:   Genre, sentiment [242]
##       ID artist        song           Genre word     sentiment
##    <int> <chr>         <chr>          <chr> <chr>        <dbl>
##  1  2986 Cliff Richard And I Love Her Pop   love             3
##  2  2986 Cliff Richard And I Love Her Pop   love             3
##  3  2986 Cliff Richard And I Love Her Pop   love             3
##  4  2986 Cliff Richard And I Love Her Pop   love             3
##  5  2986 Cliff Richard And I Love Her Pop   tenderly         0
##  6  2986 Cliff Richard And I Love Her Pop   kiss             2
##  7  2986 Cliff Richard And I Love Her Pop   lover            0
##  8  2986 Cliff Richard And I Love Her Pop   brings           0
##  9  2986 Cliff Richard And I Love Her Pop   brings           0
## 10  2986 Cliff Richard And I Love Her Pop   love             3
## # … with 208,847 more rows

Find the sum of sentiment for each Genre Change Genre NAs to zeros

sent_dat_noNAs %>%
  filter(sentiment != 0) %>%
   group_by(Genre) %>%
 
  select(sentiment) %>%
  ggplot(aes(x = sentiment)) + 
  geom_histogram(bins = 10) + 
  facet_wrap(~Genre, scales = 'free')
## Adding missing grouping variables: `Genre`

Note: the “holes” at 0!

sent_dat_noNAs %>%
  filter(sentiment != 0) %>%
   group_by(Genre) %>%
  summarize(mean_sentiment = mean(sentiment)) %>%
  ggplot(aes(x = Genre, y = mean_sentiment)) + 
  geom_col() + 
  coord_flip()

Most Genres have a positive sentiment mean but: - Rap, Hip-Hop, Electronica, DJBeats, Dance, Comedy, and Americana are negative. Rock and Country are slightly negative (essentiall neutral)

I need to add words to the affin table to enrich it’s vocabulary for these song.

afinn_row = c("frankieb", 0)
head(afinn_row)
## [1] "frankieb" "0"

Make a dataframe with these words = 0

baby, call, music, row, hold, em, songs, dolly, hear, life, night, parton, time, walk, watched, wrote, abraham, anymore, believes, birmingham, boulder, change, comin, color, day, door, eternity, feel, fits, goin, gonna, guitar, half, hummin, nashville, ol, rca, rock, saving, soul, standing, strummin, till, torch, wait, wash, world

Make a row with all the words to insert into afinn later

afinn_add_data = c("baby", "call", "music", "row", "hold", "em", "songs", "dolly", "hear", "life", "night", "parton", "time", "walk", "watched", "wrote", "abraham", "anymore", "believes", "birmingham", "boulder", "change", "comin", "color", "day", "door", "eternity", "feel", "fits", "goin", "gonna", "guitar", "half", "hummin", "nashville", "ol", "rca", "rock", "saving", "soul", "standing", "strummin", "till", "torch", "wait", "wash", "world")
                   
afinn_add_data
##  [1] "baby"       "call"       "music"      "row"        "hold"      
##  [6] "em"         "songs"      "dolly"      "hear"       "life"      
## [11] "night"      "parton"     "time"       "walk"       "watched"   
## [16] "wrote"      "abraham"    "anymore"    "believes"   "birmingham"
## [21] "boulder"    "change"     "comin"      "color"      "day"       
## [26] "door"       "eternity"   "feel"       "fits"       "goin"      
## [31] "gonna"      "guitar"     "half"       "hummin"     "nashville" 
## [36] "ol"         "rca"        "rock"       "saving"     "soul"      
## [41] "standing"   "strummin"   "till"       "torch"      "wait"      
## [46] "wash"       "world"

Add words and values into my var afinnPlus for neutral sentiment words

afinnPlus = afinn

for(i in 1:47) {
  afinn_row = c(afinn_add_data[i], 0)
afinn_row

 afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
 }
afinnPlus
## # A tibble: 2,524 x 2
##    word     value
##  * <chr>    <chr>
##  1 world    0    
##  2 wash     0    
##  3 wait     0    
##  4 torch    0    
##  5 till     0    
##  6 strummin 0    
##  7 standing 0    
##  8 soul     0    
##  9 saving   0    
## 10 rock     0    
## # … with 2,514 more rows

Now add new words to afinn with non-zero values blue, blues = -2 - heaven’s = 3 - heart, star = 2 - burn = 0 (could burn good or bad for you baby! - burning = 3 - smart = 2 - bosom, honey, touch = 2

afinn_row = c("blue", -2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)

afinn_row = c("blues", -2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)
head(afinnPlus)
## # A tibble: 6 x 2
##   word  value
##   <chr> <chr>
## 1 blues -2   
## 2 blue  -2   
## 3 world 0    
## 4 wash  0    
## 5 wait  0    
## 6 torch 0
afinn_row = c("heaven's", 3)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)

finn_row = c("heart", 2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)

afinn_row = c("star", 2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)

afinn_row = c("burn", 0)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)

afinn_row = c("burning", 3)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)

afinn_row = c("smart", 2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)

afinn_row = c("bosom", 2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)

afinn_row = c("honey", 2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)

afinn_row = c("touch", 2)
afinnPlus = InsertRow(afinnPlus, NewRow = afinn_row, RowNum=1)

head(afinnPlus)
## # A tibble: 6 x 2
##   word    value
##   <chr>   <chr>
## 1 touch   2    
## 2 honey   2    
## 3 bosom   2    
## 4 smart   2    
## 5 burning 3    
## 6 burn    0