Objective

This project should allow you to apply the information you’ve learned in the course to a new dataset. While the structure of the final project will be more of a research project, you can use this knowledge to appropriately answer questions in all fields, along with the practical skills of writing a report that others can read.

Instructions

The final document should be a knitted HTML/PDF/Word document from a Markdown file. You will turn in the knitted document along with your .Rmd. Be sure to spell and grammar check your work! The following sections should be included:

Introduction

Introduce your research topic. What is the background knowledge that someone would need to understand the field or area that you have decided to investigate? In this section, you should include sources that help explain the background area and cite them in APA style. 5-10 articles across the paper would be appropriate.

We all like listening to music. We all have different types of tastes when it comes to music. Some people prefer rock music, some people prefer hip-hop music, some people prefer country music. But have you evergiven a thought about the intrinsic elements of a music genre. Have you ever wondered how the same word could be used differently in two different genres? Have you ever wondered how the different sentiments of music has changed over time? For example what music used to be in 80s and 90s to what it is now. We all know music used to be a place to convey messages to people, a symbol of peace, a tool to purify and relax your mind and how there have been changes to that over time.

This is exactly what our study is going to tell you today. We will be talking about how words are represented in a different manner in two different genres. We will be demonstrating how use of certain sentiments have changed over time. We will also be demonstrating how a prticular word is closely associated to different words in different genres of music.

To learn more about different genres of music refer to reference 1.

Hypothesis / Problem Statement

What is the data that you are using for you project? What is your hypothesis as to the outcome of the analysis? Why is the problem important for us to study or answer?

This data contains 339277 observations and five useful variables: Song Name, Year, Artist, Genres and Lyrics. For the only numeric variable in this data set Year, it ranges from 1982 to 2016 with some outliers which will be addressed later in the data clensing. There are total of 11 categories for variable Genre: Country, Electronic, Folk, Hip-Hop, Indie, Jazz, Metal, Pop, R&B, Rock and others.

Based on our understanding in music our assumption is that rock and pop music have a high degree of overlap in terms of the words used in their lyrics as generally pop is considered a softer alternative to rock music (ref 2). Our theory is that music like rock and pop are mostly related to love where one person conveys a feeling to another person. Secondly we believe that words that have deeper meaning like love and god are used as part of the slangs by hip-hop artists whereas these words have a much deeper meaning in rock music. We would like to leverage text mining and sentiment analytics to confirm these theories. We also believe that words that represented negativeness are used in lesser context as compared to the 80s.

This problem may not be something that needs or has to be answered as some of the other popular research works but this problem definitely helps understanding how words from a language can be used in a different context in different places. This study also explains the importance of using a method like semantic vector space as compared to an association measure of normal frequency matrix.

Statistical Analysis Plan

Explain the statistical analysis that you are using - you can assume some statistical background, but not to the specific design you are mentioning. For example, the person would know what a mean is, but not Lexeme Analysis.

Analysis will be performed in the following manner:

  1. Text preprocessing:
  • Checking for missing values especially lyrics and dealing with them by either removing them or by importing lyrics using the song name and song artist from the genius package
  • Subsetting the dataframe to include data from a particular genre
  • Turning song lyrics into a vector source and turning that into a corpus that contains each lyric
  • Cleaning the corpus by removing punctuations, stopwords (such as and, the), turning each word to lower case (we do not want Look and look to be taken differently) and stripping spaces between words to have just the words
  • stemming the corpus contents to extract the root of each word. For e.g. we do want liking and liked to be considered different words
  • Creating a Term-Document Matrix i.e. terms as rows and document nos representing all the lyrics as columns to see the no of occurences of each word in each lyric and converting that into a normal matrix (Reduce sparsity as the R cannot create a matrix occupying a lot of disk space)
  • Computing overall term frequency by summing all the rows
  1. Exploratory Analysis:
  • creating a wordcloud of term frequencies for the rock genre
  • creating a wordcloud of term frequencies for the pop genre
  • creating a wordcloud of term frequencies for the hip-hop genre
  • creating a commonality cloud to see most frequently used words for rock and pop genres
  • find associations with certain words. For e.g. what are the words closest to learning you’ll find when you look at google probably machine and deep.
  • Clustering closely associated words with one another
  1. Sentiment Analysis:
  • turning each word as a unique element of a column word and extracting its sentiment by joining with the in-built nrc data
  • exploring most negative and postive words
  • positive sentiments over time
  • negative sentiments over time
  • using linear regression to identify a relationship between sentiment and time
  1. Semantic vector spaces:
  • Turning our previously built matrix into a tf-idf matrix. This allows us to ignore terms that occur in every document. For e.g. in a romantic novel the word love will occur everywhere but it doesn’t really give us any important information abou that particular novel. So tf-idf automatically weighs down such words
  • Conducting latent sentiment analytics/singular value decomposition to assess how likely is a word expected to be part of a particular document i.e. in this case lyrics
  • plotting the neighbors of god, love, fight to see what are the nearest neighbors of these words in the semantic vecot space of both rock and hip-hop lyrics to see how these words are represented in a different context across both the genres. Computing cosine similarity between these words.

Concepts used from this class: 1) Regression Analysis 2) Hierarchical Clustering 3) Sentiment Analytics 4) Semantic Vector Spaces

Method - Data - Variables

Explain the data you have selected to study. You can find data through many available corpora or other datasets online (ask for help here for sure!). How was the data collected? Who/what is in the data? Identify what the independent and dependent variables are for the analysis. How do these independent and dependent variables fit into the analyses you selected?

This data is a kaggle dataset. (ref 3)

This dataset was collected by Mr. Gyanendra Mishra who used a web crawler to extract the data using web-scraping.

Description of each variable:

song- the name of the song year- the year in which it was composed artist- who sang the song genre - what genre of music does it belong to. e.g. is it a pop song, country song or a hip-hop song lyrics: actual words of the song

In terms of a regression the dependent variable will be created later when we get to sentiment analytics. More details about this phase in the statistical anslysis results sentiment analytics section. But to add briefly dependent variable will be proportion of the sentiment by total words and the independent variable will be year to see if uses of words representing a certain sentiment have changed over time.

For most part of the analysis we will be using the lyrics to create our corpus and for text mining.

Song name and artist would be used to extract data in case we do not have lyrics or to extract the url of the song as you will see further.

Note: It doesn’t mean that the uses of song and artist are limited to what was mentioned above. There are many other things that can be done using these variables but for the purpose of our analysis and to confirm our theories we wont be needing them.

Statistical Analysis Results

Analyze the data given your statistical plan. Report the appropriate statistics for that analysis (see lecture notes). Include figures! Include the R-chunks so we can see the analyses you ran and output from the study. Note what you are doing in each step.

Loading the required packages

library(genius)
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
## 
## Attaching package: 'qdap'
## The following object is masked from 'package:base':
## 
##     Filter
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:qdap':
## 
##     ngrams
## 
## Attaching package: 'tm'
## The following objects are masked from 'package:qdap':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix
library(tidyverse)
## -- Attaching packages -------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0       v purrr   0.3.2  
## v tibble  2.1.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0
## -- Conflicts ----------------------------------------------- tidyverse_conflicts() --
## x ggplot2::%+%()      masks qdapRegex::%+%()
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::explain()    masks qdapRegex::explain()
## x dplyr::filter()     masks stats::filter()
## x dplyr::id()         masks qdapTools::id()
## x dplyr::lag()        masks stats::lag()
library(wordcloud)
library(ggthemes)
library(dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.10.0
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:qdap':
## 
##     %>%
## The following object is masked from 'package:stats':
## 
##     cutree
library(RWeka)
library(tidytext)
library(lsa)
## Loading required package: SnowballC
library(LSAfun)
## Loading required package: rgl
## 
## Attaching package: 'rgl'
## The following object is masked from 'package:qdap':
## 
##     %>%
## 
## Attaching package: 'LSAfun'
## The following object is masked from 'package:purrr':
## 
##     compose
library(cluster)

getting the required functions

clean_corpus <- function(corpus){
  # Remove punctuation
  corpus <- tm_map(corpus, removePunctuation)
  # Transform to lower cases
  corpus <- tm_map(corpus, content_transformer(tolower))
  # Remove stopwrods
  corpus <- tm_map(corpus, removeWords, stopwords('en'))
  # Strip Whitespace
  corpus <- tm_map(corpus, stripWhitespace)
  return(corpus)
}

create_genre <- function(df, genre, stem = F, tdm = T, sparse = 0.95, frequency = F){
  df <- df[df$genre == genre, ]
  df <- select(df, lyrics)
  df <- df$lyrics
  df_source <- VectorSource(df)
  df_corpus <- VCorpus(df_source)
  clean_corp <- clean_corpus(df_corpus)
  if(stem){
    clean_corp <- tm_map(clean_corp, stemDocument)
  } else{
    clean_corp
  }
  if(tdm){
    clean_m <- TermDocumentMatrix(clean_corp)
  } else{
    clean_m <- DocumentTermMatrix(clean_corp)
  }
  clean_non_sparse <- removeSparseTerms(clean_m, sparse = sparse)
  clean_m <- as.matrix(clean_non_sparse)
  if(frequency){
    clean_m <- sort(rowSums(clean_m), decreasing = T)
  } else{
    clean_m
  }
  return(clean_m)
}

create_tdm_dtm <- function(df, genre, stem = F, tdm = T, sparse = 0.95){
  df <- df[df$genre == genre, ]
  df <- select(df, lyrics)
  df <- df$lyrics
  df_source <- VectorSource(df)
  df_corpus <- VCorpus(df_source)
  clean_corp <- clean_corpus(df_corpus)
  if(stem){
    clean_corp <- tm_map(clean_corp, stemDocument)
  } else{
    clean_corp
  }
  if(tdm){
    clean_m <- TermDocumentMatrix(clean_corp)
  } else{
    clean_m <- DocumentTermMatrix(clean_corp)
  }
  clean_non_sparse <- removeSparseTerms(clean_m, sparse = sparse)
  return(clean_non_sparse)
}


percentmiss = function(x){ sum(is.na(x))/length(x) *100 }

Importing the data

setwd('C:/Users/bneer/OneDrive/Desktop/Analyzing Human Language/Final project')
lyrics <- read.csv('lyrics.csv')
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string
lyrics <- lyrics[,-1]
head(lyrics[,-5])
##                     song year          artist genre
## 1              ego-remix 2009 beyonce-knowles   Pop
## 2           then-tell-me 2009 beyonce-knowles   Pop
## 3                honesty 2009 beyonce-knowles   Pop
## 4        you-are-my-rock 2009 beyonce-knowles   Pop
## 5          black-culture 2009 beyonce-knowles   Pop
## 6 all-i-could-do-was-cry 2009 beyonce-knowles   Pop

Cleaning data

str(lyrics)
## 'data.frame':    339277 obs. of  5 variables:
##  $ song  : Factor w/ 236867 levels "0-0","0-0-0",..: 56272 205225 85347 233083 23435 8810 146916 219754 178029 228598 ...
##  $ year  : int  2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
##  $ artist: Factor w/ 17088 levels "009-sound-system",..: 4499 4499 4499 4499 4499 4499 4499 4499 4499 4499 ...
##  $ genre : Factor w/ 12 levels "Country","Electronic",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ lyrics: Factor w/ 229599 levels "","\003Its what youre afraid of.\nAll of my fears,\nAll of my Faults.\nAll that came first,\nAll will be lost.",..: 141437 150816 108359 142333 149557 92450 187210 196954 16818 136101 ...

We only have 229599 unique lyrics in a data with 339228 rows. Clearly we have a case of duplicate rows.

Checking for duplicate rows (same lyrics at multiple places)

length(table(lyrics$lyrics)[table(lyrics$lyrics) > 1])
## [1] 12921

Removing duplicates

lyrics <- lyrics[!duplicated(lyrics$lyrics), ]

Checking for missing lyrics

sum(lyrics$lyrics == '')
## [1] 1

There is one song with no lyrics

Lets try to extract the lyrics

rownames(lyrics) <- 1:nrow(lyrics)
lyrics[lyrics$lyrics == '',]
##         song year          artist genre lyrics
## 143 lemonade 2016 beyonce-knowles   Pop
lyrics$lyrics[143] <- tryCatch(genius_lyrics(artist = 'beyonce knowles', 
                                    song = 'lemonade', info = 'simple'),
                               error = function(x){return('')})
## Warning in request_GET(session, url): Not Found (HTTP 404).

The lyrics are not available in the database so we will remove this row.

lyrics <- lyrics[-143,]
sum(lyrics$lyrics == '')
## [1] 0

No more missing lyrics

length(table(lyrics$lyrics)[table(lyrics$lyrics) > 1])
## [1] 0

No more duplicates

Final sanity check

str(lyrics)
## 'data.frame':    229598 obs. of  5 variables:
##  $ song  : Factor w/ 236867 levels "0-0","0-0-0",..: 56272 205225 85347 233083 23435 8810 146916 219754 178029 228598 ...
##  $ year  : int  2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
##  $ artist: Factor w/ 17088 levels "009-sound-system",..: 4499 4499 4499 4499 4499 4499 4499 4499 4499 4499 ...
##  $ genre : Factor w/ 12 levels "Country","Electronic",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ lyrics: Factor w/ 229599 levels "","\003Its what youre afraid of.\nAll of my fears,\nAll of my Faults.\nAll that came first,\nAll will be lost.",..: 141437 150816 108359 142333 149557 92450 187210 196954 16818 136101 ...
apply(lyrics, 2, percentmiss)
##   song   year artist  genre lyrics 
##      0      0      0      0      0

No missing data

levels(lyrics$genre)
##  [1] "Country"       "Electronic"    "Folk"          "Hip-Hop"      
##  [5] "Indie"         "Jazz"          "Metal"         "Not Available"
##  [9] "Other"         "Pop"           "R&B"           "Rock"

We have some songs that we have no information about in terms of the genre. The best thing to do would be to classify it as other

lyrics$genre[lyrics$genre == 'Not Available'] <- 'Other'
lyrics$genre <- factor(lyrics$genre)
levels(lyrics$genre)
##  [1] "Country"    "Electronic" "Folk"       "Hip-Hop"    "Indie"     
##  [6] "Jazz"       "Metal"      "Other"      "Pop"        "R&B"       
## [11] "Rock"
ggplot(data = lyrics, aes(x = genre)) + geom_bar()

Most no of songs are Rock songs.

Processing data Convert the data into a tibble

lyrics_tibble <- lyrics
lyrics_tibble$lyrics <- as.character(lyrics_tibble$lyrics) 
lyrics_tibble$song <- as.character(lyrics_tibble$song)
lyrics_tibble$artist <- as.character(lyrics_tibble$artist)
lyrics_tibble$genre <- as.character(lyrics_tibble$genre)
lyrics_tibble <- as_tibble(lyrics_tibble)
lyrics_tibble
## # A tibble: 229,598 x 5
##    song           year artist     genre lyrics                             
##    <chr>         <int> <chr>      <chr> <chr>                              
##  1 ego-remix      2009 beyonce-k~ Pop   "Oh baby, how you doing?\nYou know~
##  2 then-tell-me   2009 beyonce-k~ Pop   "playin' everything so easy,\nit's~
##  3 honesty        2009 beyonce-k~ Pop   "If you search\nFor tenderness\nIt~
##  4 you-are-my-r~  2009 beyonce-k~ Pop   "Oh oh oh I, oh oh oh I\n[Verse 1:~
##  5 black-culture  2009 beyonce-k~ Pop   "Party the people, the people the ~
##  6 all-i-could-~  2009 beyonce-k~ Pop   "I heard\nChurch bells ringing\nI ~
##  7 once-in-a-li~  2009 beyonce-k~ Pop   "This is just another day that I w~
##  8 waiting        2009 beyonce-k~ Pop   "Waiting, waiting, waiting, waitin~
##  9 slow-love      2009 beyonce-k~ Pop   "[Verse 1:]\nI read all of the mag~
## 10 why-don-t-yo~  2009 beyonce-k~ Pop   "N-n-now, honey\nYou better sit do~
## # ... with 229,588 more rows

Creating a corpus for rock, pop and hip-hop genres

unique(lyrics_tibble$genre)
##  [1] "Pop"        "Hip-Hop"    "Other"      "Rock"       "Metal"     
##  [6] "Country"    "Jazz"       "Electronic" "Folk"       "R&B"       
## [11] "Indie"
rock_m <- create_genre(lyrics_tibble, 'Rock')
rock_freq <- create_genre(lyrics_tibble, 'Rock', frequency = T)
pop_m <- create_genre(lyrics_tibble, 'Pop')
pop_freq <- create_genre(lyrics_tibble, 'Pop', frequency = T)
hiphop_m <- create_genre(lyrics_tibble, 'Hip-Hop')
hiphop_freq <- create_genre(lyrics_tibble, 'Hip-Hop', frequency = T)

Here’s how the data looks

rock_m[1:10, 1:5]
##          Docs
## Terms     1 2 3 4 5
##   aint    3 1 1 1 4
##   alone   0 0 0 0 0
##   always  1 0 1 0 1
##   another 1 1 0 1 0
##   around  0 0 0 0 0
##   away    0 0 0 0 0
##   baby    0 0 0 0 0
##   back    0 0 0 1 0
##   bad     1 0 0 0 0
##   behind  0 0 1 0 0

Wordclouds

rock_vec <- names(rock_freq)
pop_vec <- names(pop_freq)
hiphop_vec <- names(hiphop_freq)
wordcloud(rock_vec, rock_freq, max.words = 50, colors = 'red')

wordcloud(pop_vec, pop_freq, max.words = 50, colors = 'red')

wordcloud(hiphop_vec, hiphop_freq, max.words = 50, colors = 'red')

One thing to note here is nigga and niggas is represented seperately and this is what we are looking to avoid and thus we need stemming.

Activating the stemming option

rock_m <- create_genre(lyrics_tibble, 'Rock', stem = T)
rock_freq <- create_genre(lyrics_tibble, 'Rock', stem = T, frequency = T)
pop_m <- create_genre(lyrics_tibble, 'Pop', stem = T)
pop_freq <- create_genre(lyrics_tibble, 'Pop', stem = T, frequency = T)
hiphop_m <- create_genre(lyrics_tibble, 'Hip-Hop', stem = T)
hiphop_freq <- create_genre(lyrics_tibble, 'Hip-Hop', stem = T, frequency = T)

Recreating the wordclouds

rock_vec <- names(rock_freq)
pop_vec <- names(pop_freq)
hiphop_vec <- names(hiphop_freq)

Rock

wordcloud(rock_vec, rock_freq, max.words = 50, colors = 'red')

clearly looks like Rock songs are mostly about someone indicated by words such as know, like, dont, your, now, world, heart, love, live, feel

clearly rock songs are about someone you love

Pop

wordcloud(pop_vec, pop_freq, max.words = 50, colors = 'red')

Interesting to see that the most frequent words used in rock music and pop music are very similar to each other.

Hip-Hop

wordcloud(hiphop_vec, hiphop_freq, max.words = 50, colors = 'red')

looks like baby, shit, man, nigga, girl, money, the b word and the f word are the most frequently used words in hip hop music that indicates its more for people who listen to raps as these are commonly used rap words

Coomonality clouds Plotting a wordcloud of common lyric words between rock and pop

rock <- lyrics[lyrics$genre == 'Rock', 'lyrics']
rock <- as.character(rock)
pop <- lyrics[lyrics$genre == 'Pop', 'lyrics']
pop <- as.character(pop)
all_rock <- paste(rock, collapse = " ")
all_pop <- paste(pop, collapse = " ")
pop_rock <- c(all_rock, all_pop)
pop_rock_source <- VectorSource(pop_rock)
pop_rock_corpus <- VCorpus(pop_rock_source)
pop_rock_clean <- clean_corpus(pop_rock_corpus)
pop_rock_tdm <- TermDocumentMatrix(pop_rock_clean)
pop_rock_tdm_non_sparse <- removeSparseTerms(pop_rock_tdm, sparse = 0.95)
pop_rock_m <- as.matrix(pop_rock_tdm)
colnames(pop_rock_m) <- c('Rock', 'Pop')
# word cloud of common words
commonality.cloud(pop_rock_m, max.words = 50, colors = 'steelblue1')

rm(rock)
rm(pop)
rm(all_rock)
rm(all_pop)
rm(pop_rock)
rm(pop_rock_source)
rm(pop_rock_corpus)
rm(pop_rock_clean)
rm(pop_rock_tdm)
rm(pop_rock_tdm_non_sparse)
rm(pop_rock_m)

we can see that know, like, don’t, you’re, now, world, heart, baby are the common words between rock and pop and thus this confirms that both pop and rock are about someone you love.

Word associations One thing to explore is that what are the words close to love in rock, pop and hip-hop?

rock_dtm <- create_tdm_dtm(lyrics_tibble, 'Rock', stem = T)
pop_dtm <- create_tdm_dtm(lyrics_tibble, 'Pop', stem = T)
hiphop_dtm <- create_tdm_dtm(lyrics_tibble, 'Hip-Hop', stem = T)

Rock

associations <- findAssocs(rock_dtm, 'love', 0.1)
associations_df <- list_vect2df(associations, col2 = 'word', col3 = 'score')
ggplot(data = associations_df, aes(score, word)) +
  geom_point(size = 3) + 
  theme_gdocs()

Pop

associations <- findAssocs(pop_dtm, 'love', 0.1)
associations_df <- list_vect2df(associations, col2 = 'word', col3 = 'score')
ggplot(data = associations_df, aes(score, word)) +
  geom_point(size = 3) + 
  theme_gdocs()

Hip-Hop

associations <- findAssocs(hiphop_dtm, 'love', 0.1)
associations_df <- list_vect2df(associations, col2 = 'word', col3 = 'score')
ggplot(data = associations_df, aes(score, word)) +
  geom_point(size = 3) + 
  theme_gdocs()

In rock and pop words that are more closely associated with love are heart, need, baby, feel, true. While in hip-hop too they are close to love but there also exists words like girl and hate that indicate its not only used in the same romantic sense as in case of rock and pop but also in a typical rapper format where its tied with girl and even hate.

Lets find associations for the term nigga

associations_nigga <- findAssocs(hiphop_dtm, 'nigga', 0.2)
associations_nigga_df <- list_vect2df(associations_nigga, col2 = 'word', 
                                      col3 = 'score')
ggplot(data = associations_nigga_df, aes(score, word)) +
  geom_point(size = 3) + 
  theme_gdocs()

S word, b word and f word are the closest to nigga and this makes sense as in a rap context most of the times we see lyrics that have s— nigga, f— nigga, b—- nigga.

Exploring the association of words close to money in hip-hop music

associations_money <- findAssocs(hiphop_dtm, 'money', 0.1)
associations_money_df <- list_vect2df(associations_money, col2 = 'word', 
                                      col3 = 'score')
ggplot(data = associations_money_df, aes(score, word)) +
  geom_point(size = 3) + 
  theme_gdocs()

As expected cash is the closest but b—-, nigga, buy, bank, hundr, dirty, dough are all words so commonly close to each other as part of raps.

Clustering 1) Rock

rock_dtm2 <- removeSparseTerms(rock_dtm, sparse = 0.75)
rock_df <- as.data.frame(as.matrix(rock_dtm2))
rock_dist <- dist(rock_df)
hc_rock <- hclust(rock_dist, method = 'ward.D2')
hc_rock
## 
## Call:
## hclust(d = rock_dist, method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 21
plot(hc_rock)

Using silhoutte width to identify optimal no of clusters

sapply(2:11, function(x) summary(silhouette(cutree(hc_rock, k = x), rock_dist))$avg.width)
##  [1] 0.25041182 0.15706970 0.11688794 0.08058145 0.05602927 0.05468098
##  [7] 0.04706088 0.04597518 0.04255026 0.04264160

looks like the optimal no of clusters are 2

{plot(hc_rock, hang = -1) 
 rect.hclust(hc_rock, k = 2)}

Looks like love fits has its own cluster. Lets remove that and see how the clustering goes

rock_df_nolove <- rock_df[!rownames(rock_df) %in% ('love'),]
rock_dist_nolove <- dist(rock_df_nolove)
hc_rock_nolove <- hclust(rock_dist_nolove, method = 'ward.D2')
hc_rock_nolove
## 
## Call:
## hclust(d = rock_dist_nolove, method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 20
plot(hc_rock_nolove)

sapply(2:11, function(x) summary(silhouette(cutree(hc_rock_nolove, k = x), rock_dist_nolove))$avg.width)
##  [1] 0.16492318 0.12273234 0.08461053 0.05883074 0.05741503 0.04941392
##  [7] 0.04827394 0.04467778 0.04477368 0.03732417
{plot(hc_rock_nolove, hang = -1) 
 rect.hclust(hc_rock_nolove, k = 2)}

Now dont know has its own cluster. We saw from the word cloud that love, dont know are the most common word lyrics for rock music. There are many songs that start have things like I dont know how, I dont want to know, If you didn’t know. Thanks to the removal of stopwords and usage of stemming we are able to see that.

  1. Hip-Hop
hiphop_dtm2 <- removeSparseTerms(hiphop_dtm, sparse = 0.60)
hiphop_df <- as.data.frame(as.matrix(hiphop_dtm2))
hiphop_dist <- dist(hiphop_df)
hc_hiphop <- hclust(hiphop_dist, method = 'ward.D2')
hc_hiphop
## 
## Call:
## hclust(d = hiphop_dist, method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 28
plot(hc_hiphop)

sapply(2:11, function(x) summary(silhouette(cutree(hc_hiphop, k = x), hiphop_dist))$avg.width)
##  [1] 0.31038093 0.29377729 0.29020628 0.17038019 0.17187303 0.07800756
##  [7] 0.07945610 0.08239001 0.08342999 0.06204088
{plot(hc_hiphop, hang = -1) 
 rect.hclust(hc_hiphop, k = 2)}

Nigga, get like are in their own cluster and other rap words including f— and b—- along are in another cluster. Again thats consistent with what our wordclouds told us earlier. Whats interesting to see is the use of love in hip-hop. For rock love had its own cluster and was closer to your, get, like. Here love is in the same cluster close to aint, f— and shit, which indicates that they are used in different contexts in both forms of music.

Sentiment Aanalysis for the hip-hop genre Is there a relationship between different sentiments and time?

Extracting the sentiments

lyrics_tibble_hiphop <- filter(lyrics_tibble, genre == 'Hip-Hop')

tidy_lyrics <- lyrics_tibble_hiphop %>%
  unnest_tokens(word, lyrics)

# word totals for each song
totals_hiphop <- tidy_lyrics %>%
  count(song) %>%
  rename(total_words = n)

lyric_counts <- tidy_lyrics %>%
  left_join(totals_hiphop, by = 'song')

lyric_sentiment <- lyric_counts %>%
  inner_join(get_sentiments('nrc'))
## Joining, by = "word"

How many sentiment each word has?

lyric_sentiment %>%
  count(song, sentiment, sort = TRUE)
## # A tibble: 167,826 x 3
##    song         sentiment     n
##    <chr>        <chr>     <int>
##  1 bang-bang    negative    500
##  2 intro        negative    483
##  3 bang-bang    anger       457
##  4 bang-bang    disgust     420
##  5 bang-bang    fear        420
##  6 money        positive    420
##  7 intro        positive    409
##  8 rap-monument negative    392
##  9 bang-bang    sadness     381
## 10 bang-bang    surprise    373
## # ... with 167,816 more rows

The most negative songs

lyric_sentiment %>%
  # Count using three arguments
  count(song, sentiment, total_words) %>%
  ungroup() %>%
  # Make a new percent column with mutate 
  mutate(percent = n/total_words) %>%
  # Filter for only negative words
  filter(sentiment == 'negative') %>%
  # Arrange by descending percent
  arrange(desc(percent))
## # A tibble: 17,388 x 5
##    song                    sentiment total_words     n percent
##    <chr>                   <chr>           <int> <int>   <dbl>
##  1 boy-oh-boy-thugli-remix negative           47    19   0.404
##  2 riot-fight              negative           81    30   0.370
##  3 where                   negative           17     6   0.353
##  4 mud-digger              negative            9     3   0.333
##  5 charlie-manson          negative          423   138   0.326
##  6 shaky-shaky-remix       negative          605   171   0.283
##  7 red-opps                negative          447   121   0.271
##  8 where-we-from           negative           45    12   0.267
##  9 rock-shyt               negative            4     1   0.25 
## 10 mo-thug-interlude       negative          229    55   0.240
## # ... with 17,378 more rows

The most positive words

lyric_sentiment %>%
  count(song, sentiment, total_words) %>%
  ungroup() %>%
  mutate(percent = n/total_words) %>%
  filter(sentiment == 'positive') %>%
  arrange(desc(percent))
## # A tibble: 17,419 x 5
##    song                          sentiment total_words     n percent
##    <chr>                         <chr>           <int> <int>   <dbl>
##  1 holy-god                      positive          153    94   0.614
##  2 agnus-dei                     positive           76    32   0.421
##  3 we-found-love-remix           positive           10     4   0.4  
##  4 triune-god                    positive          144    49   0.340
##  5 shattered                     positive          248    83   0.335
##  6 hot-metal                     positive           12     4   0.333
##  7 packet-prelude                positive            6     2   0.333
##  8 so-real                       positive            3     1   0.333
##  9 praise-his-holy-name          positive           72    23   0.319
## 10 o-christmas-tree-o-tannenbaum positive           97    30   0.309
## # ... with 17,409 more rows

All the sentiments

unique(lyric_sentiment$sentiment)
##  [1] "positive"     "anger"        "negative"     "disgust"     
##  [5] "fear"         "sadness"      "anticipation" "joy"         
##  [9] "surprise"     "trust"

Lets look at song evolution over time First look at all the unique years

unique(lyric_sentiment$year)
##  [1] 2007 1998 2006 2002 1995 2009 2010 2012 2015 2014 2013 2011 2008 2016
## [15] 2004 2005 2003 1992  702 1989 1996 1999 1994 2001 2000  112 1991 1990
## [29] 1982 1993 1997

We can see that clearly there is a mistake. Tow of the years are 112 and 702. Unless we are taking about the after christ era thats not possible.

Lets get the songs for these years

faulty_years <- lyric_sentiment %>%
                  filter(year == 702 | year == 112) 

song_faulty <- faulty_years %>% group_by(song, artist) %>%
  summarise(count = n())

song_faulty
## # A tibble: 4 x 3
## # Groups:   song [4]
##   song                artist    count
##   <chr>               <chr>     <int>
## 1 anywhere-remix      dru-hill     96
## 2 come-see-me-remix   black-rob   103
## 3 it-s-over-now-remix g-dep       103
## 4 star                clipse       50

Now that we know the song name and artist its easy to find the years. We look up on google or use the gen_song_url function. We wil ldo both.

Song url through genius package (Look at reference 1 for the google links.)

for(i in 1:nrow(song_faulty)){
  print(gen_song_url(artist = song_faulty$artist[i], song = song_faulty$song[i]))
}
## [1] "https://genius.com/dru-hill-anywhere-remix-lyrics"
## [1] "https://genius.com/black-rob-come-see-me-remix-lyrics"
## [1] "https://genius.com/g-dep-it-s-over-now-remix-lyrics"
## [1] "https://genius.com/clipse-star-lyrics"

anywhere-remix by dru-hill was composed in 1999. (ref 4) come-see-me-remix by black-rob was composed in 1996. (ref 4) it-s-over-now-remix was composed in 2001. (ref 4) star was composed in 2002. (ref 4)

Correcting the years

lyric_sentiment[lyric_sentiment$song == 'anywhere-remix', ]$year <- 1999
lyric_sentiment[lyric_sentiment$song == 'come-see-me-remix', ]$year <- 1996
lyric_sentiment[lyric_sentiment$song == 'it-s-over-now-remix', ]$year <- 2001
lyric_sentiment[lyric_sentiment$song == 'star', ]$year <- 2002

Checking again

unique(lyric_sentiment$year)
##  [1] 2007 1998 2006 2002 1995 2009 2010 2012 2015 2014 2013 2011 2008 2016
## [15] 2004 2005 2003 1992 1989 1996 1999 1994 2001 2000 1991 1990 1982 1993
## [29] 1997

Looks good now.

The trend in emotional content in lyrics in Hip-hop from the early 80s – 2016

What emotions are most represented in hip-hop

# Creating a copy of the file so that if we go wrong we have the original 
new_hip_hop <- lyric_sentiment
new_hip_hop %>%
  ggplot(aes(x = factor(sentiment))) +
  geom_bar()

Mostly negative and positive sentiments.

Postive sentiments

new_hip_hop %>%
  filter(sentiment == 'positive') %>%
  count(song, year, total_words) %>%
  ungroup() %>%
  mutate(percent = n / total_words,
         year = 10 * floor(year / 10)) %>%
  ggplot(aes(x = factor(year), y = percent)) +
  geom_boxplot()

Looks like from the 80s till now th trend in positive words does not appear to have changed by much. Hip-hop lyrics appear to have used positive words consistently from the 80s till today.

Negative words

new_hip_hop %>%
  filter(sentiment == 'negative') %>%
  count(song, year, total_words) %>%
  ungroup() %>%
  mutate(percent = n / total_words,
         year = 10 * floor(year / 10)) %>%
  ggplot(aes(x = factor(year), y = percent)) +
  geom_boxplot()

This is interesting. There was a higher usage of negative words in the 1980s as compared to the 1990s and has been consistent since then so usage of negative words appears to have decreased over time atleast when you compare between the 80s and the 90s.

Fear

new_hip_hop %>%
  filter(sentiment == 'fear') %>%
  count(song, year, total_words) %>%
  ungroup() %>%
  mutate(percent = n / total_words,
         year = 10 * floor(year / 10)) %>%
  ggplot(aes(x = factor(year), y = percent)) +
  geom_boxplot()

The usage of fearful words indicate a right skew as there is a higher concentration of points of word frequencies above the median. Again as in case of negative words fearful words appear to have decreased when you compare the 80s and 90s and stable since then.

Joyful words

new_hip_hop %>%
  filter(sentiment == 'joy') %>%
  count(song, year, total_words) %>%
  ungroup() %>%
  mutate(percent = n / total_words,
         year = 10 * floor(year / 10)) %>%
  ggplot(aes(x = factor(year), y = percent)) +
  geom_boxplot()

Again there appears to be some difference when you compare the 80s and 90s in usage of joyful words.

Lets model these relationships

Positive

pos_by_year <- lyric_sentiment %>%
  # Filter for positive words
  filter(sentiment == 'positive') %>%
  count(song, year, total_words) %>%
  ungroup() %>%
  # Define a new column: percent
  mutate(percent = n/total_words)

model_pos_emo <- lm(percent ~ year, data = pos_by_year)
summary(model_pos_emo)
## 
## Call:
## lm(formula = percent ~ year, data = pos_by_year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03998 -0.01683 -0.00383  0.01065  0.57485 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  1.932e-01  9.758e-02   1.980   0.0477 *
## year        -7.653e-05  4.858e-05  -1.575   0.1152  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02778 on 19146 degrees of freedom
## Multiple R-squared:  0.0001296,  Adjusted R-squared:  7.739e-05 
## F-statistic: 2.482 on 1 and 19146 DF,  p-value: 0.1152

As expected year does not have a significant relationship with the positive words in hip-hop lyrics.

Negative

neg_by_year <- lyric_sentiment %>%
  # Filter for negative words
  filter(sentiment == 'negative') %>%
  count(song, year, total_words) %>%
  ungroup() %>%
  mutate(percent = n/total_words)

model_neg_emo <- lm(percent ~ year, data = neg_by_year)
summary(model_neg_emo)
## 
## Call:
## lm(formula = percent ~ year, data = neg_by_year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04481 -0.01822 -0.00302  0.01320  0.36474 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.983e-01  9.290e-02   7.517 5.85e-14 ***
## year        -3.271e-04  4.625e-05  -7.073 1.56e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02636 on 19090 degrees of freedom
## Multiple R-squared:  0.002614,   Adjusted R-squared:  0.002562 
## F-statistic: 50.03 on 1 and 19090 DF,  p-value: 1.564e-12

Negative words however has a significant relationship with year indicating that the use of negative words in hip-hop lyrics has decreased over time from the 80s till now.

Fear

fear_by_year <- lyric_sentiment %>%
  # Filter for negative words
  filter(sentiment == 'fear') %>%
  count(song, year, total_words) %>%
  ungroup() %>%
  mutate(percent = n/total_words)

model_fear_emo <- lm(percent ~ year, data = fear_by_year)
summary(model_fear_emo)
## 
## Call:
## lm(formula = percent ~ year, data = fear_by_year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02335 -0.01196 -0.00372  0.00696  0.47847 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.2488931  0.0666856   3.732 0.000190 ***
## year        -0.0001128  0.0000332  -3.397 0.000682 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0186 on 18537 degrees of freedom
## Multiple R-squared:  0.0006222,  Adjusted R-squared:  0.0005683 
## F-statistic: 11.54 on 1 and 18537 DF,  p-value: 0.0006822

Again there is a significant relationship between the proportion of fearful years with year indicating that the use of fearful words has decreased over time.

Joy

joy_by_year <- lyric_sentiment %>%
  # Filter for negative words
  filter(sentiment == 'joy') %>%
  count(song, year, total_words) %>%
  ungroup() %>%
  mutate(percent = n/total_words)

model_joy_emo <- lm(percent ~ year, data = joy_by_year)
summary(model_joy_emo)
## 
## Call:
## lm(formula = percent ~ year, data = joy_by_year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02307 -0.01193 -0.00511  0.00477  0.37794 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.4942009  0.0757234  -6.526 6.91e-11 ***
## year         0.0002567  0.0000377   6.810 1.01e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.021 on 18367 degrees of freedom
## Multiple R-squared:  0.002519,   Adjusted R-squared:  0.002464 
## F-statistic: 46.38 on 1 and 18367 DF,  p-value: 1.005e-11

As our boxplots suggested the use of joy ful words have decreased over time.

Now we have seen the sentiments individually till now but what if we combine them and see if the usage of words that are positve or joyful in hip-hop lyrics have changed when compared with fearful and negative.

Positive and joyful

posjoy_by_year <- lyric_sentiment %>%
  # Filter for negative words
  filter(sentiment == 'positive' | sentiment == 'joy') %>%
  count(song, year, total_words) %>%
  ungroup() %>%
  mutate(percent = n/total_words)

model_posjoy_emo <- lm(percent ~ year, data = posjoy_by_year)
summary(model_posjoy_emo)
## 
## Call:
## lm(formula = percent ~ year, data = posjoy_by_year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.06034 -0.02725 -0.00813  0.01502  0.73962 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.2144800  0.1633071  -1.313   0.1891  
## year         0.0001367  0.0000813   1.681   0.0927 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0465 on 19149 degrees of freedom
## Multiple R-squared:  0.0001476,  Adjusted R-squared:  9.537e-05 
## F-statistic: 2.827 on 1 and 19149 DF,  p-value: 0.09274

There isn’t a significant relationship between the usage of positive and joyful words with year.

Negative and fearful

negfear_by_year <- lyric_sentiment %>%
  # Filter for negative words
  filter(sentiment == 'negative' | sentiment == 'fear') %>%
  count(song, year, total_words) %>%
  ungroup() %>%
  mutate(percent = n/total_words)

model_negfear_emo <- lm(percent ~ year, data = negfear_by_year)
summary(model_negfear_emo)
## 
## Call:
## lm(formula = percent ~ year, data = negfear_by_year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.06781 -0.02815 -0.00549  0.01972  0.65863 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.030e+00  1.458e-01   7.063 1.68e-12 ***
## year        -4.814e-04  7.256e-05  -6.634 3.37e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04156 on 19175 degrees of freedom
## Multiple R-squared:  0.00229,    Adjusted R-squared:  0.002238 
## F-statistic:    44 on 1 and 19175 DF,  p-value: 3.365e-11

Again negative and fearful words have decreased over time.

  • Semantic Vector Spaces Now that we have modelled and understood that use of words in hip-hop lyrics for certain sentiments such as negative and fearful have changed over time lets try and understand how the relationship between words pans out especially when we try to look at word frequencies using tf-idf.

tf-idf transformation

what is tf-idf

knitr::include_graphics("Capture.PNG")

This kind of transofrmation allows us to weigh words that occur more frequently in some documents rather than words that occur more frequently in every document.

hiphop_weight <- lw_logtf(hiphop_m) * gw_idf(hiphop_m)
hiphop_weight[1:10,1:5]
##          Docs
## Terms             1        2        3        4 5
##   act      0.000000 0.000000 5.614559 0.000000 0
##   aint     2.874354 0.000000 0.000000 3.199968 0
##   air      0.000000 0.000000 0.000000 0.000000 0
##   alon     0.000000 0.000000 0.000000 0.000000 0
##   alreadi 10.379107 3.459702 0.000000 0.000000 0
##   alway    0.000000 0.000000 0.000000 6.309563 0
##   anoth    0.000000 0.000000 0.000000 2.556010 0
##   anyth    0.000000 0.000000 0.000000 0.000000 0
##   arm      0.000000 0.000000 0.000000 0.000000 0
##   around   0.000000 2.038425 0.000000 0.000000 0

LSA or Singular value decomposition Now we want to know the likelihood of a word being part of a certain lyric.

hiphop_lsa <- lsa(hiphop_weight)
hiphop_lsa$tk[1:20,1:5]
##                [,1]         [,2]          [,3]         [,4]          [,5]
## act     -0.03480244  0.008100931 -0.0117173074  0.021668166 -1.645902e-02
## aint    -0.11087805  0.049291005 -0.0574075158 -0.081649567  4.034618e-03
## air     -0.02466065 -0.001073388 -0.0020002633  0.036047161 -1.537891e-02
## alon    -0.02425898 -0.058879005  0.0025335214 -0.036290013  3.527319e-02
## alreadi -0.02099539 -0.009305805 -0.0221289490 -0.012966318 -1.474174e-02
## alway   -0.04400585 -0.054321844  0.0094484544 -0.039458815 -1.153625e-02
## anoth   -0.04141720 -0.025157073  0.0279065097 -0.009862423 -1.041954e-02
## anyth   -0.01898343 -0.027999761 -0.0137310848 -0.020928304 -7.026555e-05
## arm     -0.01818028 -0.003579844  0.0227906092  0.014086706 -2.284745e-03
## around  -0.05772882 -0.002707595 -0.0100973361  0.007742271  6.688059e-03
## ask     -0.04220455 -0.008738280 -0.0006345552  0.003754670 -3.964757e-02
## ass     -0.06959866  0.122095289 -0.0345856885  0.041283926  5.135187e-02
## away    -0.03789859 -0.075999214  0.0354763858 -0.055598721  4.191956e-02
## babi    -0.07387702 -0.103498801 -0.2395487446 -0.010093611  1.820231e-01
## back    -0.09117613  0.008310097  0.0086951455  0.045443585 -1.356482e-02
## bad     -0.04442159  0.005976440 -0.0311681246 -0.007291067 -3.513891e-02
## bag     -0.02596100  0.029919505 -0.0205709832  0.011840158 -6.211249e-02
## ball    -0.03168073  0.038758049 -0.0205514473  0.001686696 -5.190880e-02
## bang    -0.02634671  0.048980106  0.0062113353  0.011313392  2.193570e-02
## bank    -0.01920501  0.029773050 -0.0178481910 -0.006116624 -4.013172e-02

Here we can say that bank for e.g. is more likely to occur in lyric 2 and less likely in lyrics 1,3,4,5. Air is more likely to occur in lyrics 4 as compared to lyrics 1,2,3,5. This matrix allows us to see what words are likely to be part of the same lyrics and thus more likely to be close to one another. Using tf-idf and lsa and then finding these neighboring words provides us better results as compared to using just their frequencies and plotting based on associations as now we have consdered the words more likely to occur in some lyrics by giving them higher weights and by lowering the weights of words likely to occur in every lyric through tf-idf. Using the LSA on the tf-idf matrix we have estimated the likelihood of a word being part of a certian lyric and thus those that have more likelihoods of being in the same lyrics are likely to be closer to one another. For e.g. b–g, ball, bank, bad are all likely to be a part of lyrics 2.

Visualizing the nearest neighboring words by converting the lsa matrix to a text matrix

hiphop_lsa <- as.textmatrix(hiphop_lsa)

Does Hip-hop actually need explicit words to make it sound better, what about words like god, love, fight? It would be beneficial to know in what context are these words used in rock and hiphop.

We saw earlier the associations of words with love lets look at the nearest neighbors of the word love

plot_neighbors('love', n = 10, tvectors = hiphop_lsa, method = 'MDS',
               dims = 2)

##                 x            y
## love  0.392197538  0.007502624
## know  0.001252845  0.034051261
## just -0.003667512  0.021808355
## dont -0.032886954  0.011014319
## make -0.019187902 -0.167634249
## like -0.054413510 -0.018198256
## now  -0.057300014  0.027333658
## got  -0.069406422  0.003463801
## caus -0.066310244  0.097571780
## get  -0.090277827 -0.016913293

Now we can see that even though not close enough as you would normally want but words like make, cause, just, get are the nearest neighbors of love and it makes sense as the use of love in hip-hop songs is different from the way it is used in rock and pop songs where it is in a more romantic sense.

Lets confirm that by looking at the use of love in rock music

rock_weight <- lw_logtf(rock_m) * gw_idf(rock_m)
rock_lsa <- lsa(rock_weight)
rock_lsa <- as.textmatrix(rock_lsa)
plot_neighbors('love', n = 10, tvectors = rock_lsa, method = 'MDS',
               dims = 2)

##                 x           y
## love  -0.22638180 -0.03903309
## true  -0.10177218  0.07500461
## kiss  -0.17761713 -0.26971970
## sure   0.22161890 -0.03050484
## know   0.12411929 -0.03216807
## just   0.13655588 -0.09504119
## made   0.13788701  0.14327465
## smile  0.08054694 -0.23116227
## heart -0.35746376  0.20080408
## found  0.16250686  0.27854581

Here we can see that love is more surrounded by kiss, true, heart, smile, found that indicates that love is more used in a romantic sense in rock music as compared to a sort of a rapper way in hip-hop where it is more of a slang then a conveyance of a feeling.

Lets look at the use of the word god

Hip-Hop

plot_neighbors('god', n = 10, tvectors = hiphop_lsa, method = 'MDS',
               dims = 2)

##                 x           y
## god   -0.20636151 -0.08765558
## bless -0.27632179 -0.05005261
## thank -0.29835300 -0.17231468
## pray  -0.25695484 -0.01435238
## save  -0.05194345  0.32872585
## power  0.05176251  0.34657782
## swear  0.25265300 -0.40517210
## see    0.24004275  0.04081413
## like   0.28985716  0.03666967
## just   0.25561916 -0.02324012

As expected pray, bless, thank, save, power, swear are the words that are close to the use of word god.

Rock

plot_neighbors('god', n = 10, tvectors = rock_lsa, method = 'MDS',
               dims = 2)

##                x           y
## god   0.03008374 -0.20112406
## soul  0.03538635 -0.02188907
## hell  0.28507869 -0.07502462
## name -0.11188762 -0.37839246
## hope -0.07976144  0.24723374
## must -0.11745096  0.01426231
## made -0.14684025 -0.03124707
## help -0.30477211  0.15650035
## fear  0.04271459  0.17264530
## dead  0.36744901  0.11703558

Fear, soul, hope, hell, help, dead are the words surrounding god for rock music that clearly indicates that the word god over here is used more in context with a person’s soulmate rather than using it generally for prayers and blessings as used by hip-hop music.

Now lets look at the word fight Hip-Hop

plot_neighbors('fight', n = 10, tvectors = hiphop_lsa, method = 'MDS',
               dims = 2)

##                 x           y
## fight  0.36371805 -0.30704761
## dont  -0.08884576  0.01431749
## back  -0.10621436 -0.03111350
## get   -0.07724950  0.05871461
## power  0.45401446  0.24348586
## right -0.12914871 -0.20502861
## just  -0.10095056  0.04593225
## like  -0.07859474  0.07558688
## caus  -0.10794463  0.06123524
## know  -0.12878425  0.04391740

Rock

plot_neighbors('fight', n = 10, tvectors = rock_lsa, method = 'MDS',
               dims = 2)

##                  x            y
## fight  0.011616849 -0.118284455
## side  -0.003458095 -0.068949821
## stand  0.258603923 -0.073485971
## right -0.339230686  0.005197094
## line   0.155664208  0.105540440
## fear   0.213181366 -0.342559723
## work   0.131140913  0.307133393
## what  -0.161678221  0.026009596
## wrong -0.405077866 -0.037173196
## put    0.139237610  0.196572641

The word fight is surrounded by power, back, dont for hip-hop music indicating that it is used in its literal sense as fighting with someone as thats when you would say fight back or dont fight. In Rock music however fight is surrounded by side, stand, line, right, wrong. This indicates that here fight is not used in literal sense as this indicates fighting for someone or having them stand by your side.

Now we saw how words like love, god, fight are used in different context in two different genres of music.

Lets look at cosine similarities between these words in both our vector spaces

list1 <- c('god', 'love', 'fight')
print('Hip-Hop')
## [1] "Hip-Hop"
multicos(list1, tvectors = hiphop_lsa)
##             god      love     fight
## god   1.0000000 0.2754689 0.2501409
## love  0.2754689 1.0000000 0.2630621
## fight 0.2501409 0.2630621 1.0000000
print('Rock')
## [1] "Rock"
multicos(list1, tvectors = rock_lsa)
##             god      love     fight
## god   1.0000000 0.1807514 0.1911263
## love  0.1807514 1.0000000 0.2385054
## fight 0.1911263 0.2385054 1.0000000

We can see that in hip-hop music god has a higher cosine similarity with love than fight whereas in rock music fight has a higher cosine similarity with love than god.

Interpret and Discuss

Summarize the results from your study in as plain of language as possible. How does this relate to previous literature? Where the results supportive of your hypotheses? What have we learned from you doing this analysis/study?

We saw how much of an overlap we have in rock and pop music from their wordclouds that helped us see how similar these genres are in terms of the words used in their lyrics. We also saw how the the usage of certain words have changed over time especially when compared with the 80s. In 80s we had a much higher frequency of negative and fearful words than we do now. This decline started from the 90s. We also saw how words like god, love and fight are surrounded by different set of words in rock and hip hop genres. Love was used to convey feelings in rock music whereas in hip-hop it was used with other hip-hop slangs. God was used in a way its used in normal circumstances with prayer and blessings in hip-hop genre whereas in the Rock genre it was used with soul, hope that indicated it was being used to convey feelings towards some person. Thus our analysis confirmed our theories stated in our problem statement.

Many previous studies in music have been focused in the area of sentiment analytics when it comes to music in fact many studies have explored sentiments even further than we have by measuring sentiment scores and extracting top songs representing each sentiments. However, we have gone a slightly different route from traditional music analytics studies as we have also tried to see how some words can be represented in different manner in two different genres of music.

There is definitely a lot of other things to explore with this data such as how did the proportions of sentiments changed over time for each genre. In fact this could also tell us which genre had more or less negative or positive words now than it had before and also not just negative words but even other individual words such as the n—- word that can be explored further to see how overtime its use has changed and who are the artists who have used it most no of times over the years and how that has changed.

References

Include your references in APA style.

  1. Getting information about different music genres: “The Genealogy and History of Popular Music Genres.” Musicmap, musicmap.info/.

  2. Rock vs Pop: Kivumbi. “Difference Between.” Difference Between Similar Terms and Objects, 20 Feb. 2011, www.differencebetween.net/miscellaneous/difference-between-rock-and-pop/.

  3. Datset: GyanendraMishra. “380,000+ Lyrics from MetroLyrics.” Kaggle, 11 Jan. 2017, www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics.

  4. Song year:

  • any-where remix dru-hill “Shyne Discography.” Wikipedia, Wikimedia Foundation, 17 May 2018, en.wikipedia.org/wiki/Shyne_discography.

  • come-see-me-remix black rob “112 (Ft. Black Rob) – Come See Me (Remix).” Genius, 21 Oct. 1996, genius.com/112-come-see-me-remix-lyrics.

  • it-s-over-now-remix g-dep “112 - It’s Over Now (Remixes).” Discogs, 1 Jan. 1970, www.discogs.com/112-Its-Over-Now-Remixes/release/1637082.

  • star clipse “702 (Ft. Clipse) – Star.” Genius, 10 Dec. 2002, genius.com/702-star-lyrics.