This project should allow you to apply the information you’ve learned in the course to a new dataset. While the structure of the final project will be more of a research project, you can use this knowledge to appropriately answer questions in all fields, along with the practical skills of writing a report that others can read.
The final document should be a knitted HTML/PDF/Word document from a Markdown file. You will turn in the knitted document along with your .Rmd. Be sure to spell and grammar check your work! The following sections should be included:
Introduce your research topic. What is the background knowledge that someone would need to understand the field or area that you have decided to investigate? In this section, you should include sources that help explain the background area and cite them in APA style. 5-10 articles across the paper would be appropriate.
We all like listening to music. We all have different types of tastes when it comes to music. Some people prefer rock music, some people prefer hip-hop music, some people prefer country music. But have you evergiven a thought about the intrinsic elements of a music genre. Have you ever wondered how the same word could be used differently in two different genres? Have you ever wondered how the different sentiments of music has changed over time? For example what music used to be in 80s and 90s to what it is now. We all know music used to be a place to convey messages to people, a symbol of peace, a tool to purify and relax your mind and how there have been changes to that over time.
This is exactly what our study is going to tell you today. We will be talking about how words are represented in a different manner in two different genres. We will be demonstrating how use of certain sentiments have changed over time. We will also be demonstrating how a prticular word is closely associated to different words in different genres of music.
To learn more about different genres of music refer to reference 1.
What is the data that you are using for you project? What is your hypothesis as to the outcome of the analysis? Why is the problem important for us to study or answer?
This data contains 339277 observations and five useful variables: Song Name, Year, Artist, Genres and Lyrics. For the only numeric variable in this data set Year, it ranges from 1982 to 2016 with some outliers which will be addressed later in the data clensing. There are total of 11 categories for variable Genre: Country, Electronic, Folk, Hip-Hop, Indie, Jazz, Metal, Pop, R&B, Rock and others.
Based on our understanding in music our assumption is that rock and pop music have a high degree of overlap in terms of the words used in their lyrics as generally pop is considered a softer alternative to rock music (ref 2). Our theory is that music like rock and pop are mostly related to love where one person conveys a feeling to another person. Secondly we believe that words that have deeper meaning like love and god are used as part of the slangs by hip-hop artists whereas these words have a much deeper meaning in rock music. We would like to leverage text mining and sentiment analytics to confirm these theories. We also believe that words that represented negativeness are used in lesser context as compared to the 80s.
This problem may not be something that needs or has to be answered as some of the other popular research works but this problem definitely helps understanding how words from a language can be used in a different context in different places. This study also explains the importance of using a method like semantic vector space as compared to an association measure of normal frequency matrix.
Explain the statistical analysis that you are using - you can assume some statistical background, but not to the specific design you are mentioning. For example, the person would know what a mean is, but not Lexeme Analysis.
Analysis will be performed in the following manner:
Concepts used from this class: 1) Regression Analysis 2) Hierarchical Clustering 3) Sentiment Analytics 4) Semantic Vector Spaces
Explain the data you have selected to study. You can find data through many available corpora or other datasets online (ask for help here for sure!). How was the data collected? Who/what is in the data? Identify what the independent and dependent variables are for the analysis. How do these independent and dependent variables fit into the analyses you selected?
This data is a kaggle dataset. (ref 3)
This dataset was collected by Mr. Gyanendra Mishra who used a web crawler to extract the data using web-scraping.
Description of each variable:
song- the name of the song year- the year in which it was composed artist- who sang the song genre - what genre of music does it belong to. e.g. is it a pop song, country song or a hip-hop song lyrics: actual words of the song
In terms of a regression the dependent variable will be created later when we get to sentiment analytics. More details about this phase in the statistical anslysis results sentiment analytics section. But to add briefly dependent variable will be proportion of the sentiment by total words and the independent variable will be year to see if uses of words representing a certain sentiment have changed over time.
For most part of the analysis we will be using the lyrics to create our corpus and for text mining.
Song name and artist would be used to extract data in case we do not have lyrics or to extract the url of the song as you will see further.
Note: It doesn’t mean that the uses of song and artist are limited to what was mentioned above. There are many other things that can be done using these variables but for the purpose of our analysis and to confirm our theories we wont be needing them.
Analyze the data given your statistical plan. Report the appropriate statistics for that analysis (see lecture notes). Include figures! Include the R-chunks so we can see the analyses you ran and output from the study. Note what you are doing in each step.
Loading the required packages
library(genius)
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
##
## Attaching package: 'qdap'
## The following object is masked from 'package:base':
##
## Filter
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:qdap':
##
## ngrams
##
## Attaching package: 'tm'
## The following objects are masked from 'package:qdap':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
library(tidyverse)
## -- Attaching packages -------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.3.2
## v tibble 2.1.1 v dplyr 0.8.0.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------- tidyverse_conflicts() --
## x ggplot2::%+%() masks qdapRegex::%+%()
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::explain() masks qdapRegex::explain()
## x dplyr::filter() masks stats::filter()
## x dplyr::id() masks qdapTools::id()
## x dplyr::lag() masks stats::lag()
library(wordcloud)
library(ggthemes)
library(dendextend)
##
## ---------------------
## Welcome to dendextend version 1.10.0
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:qdap':
##
## %>%
## The following object is masked from 'package:stats':
##
## cutree
library(RWeka)
library(tidytext)
library(lsa)
## Loading required package: SnowballC
library(LSAfun)
## Loading required package: rgl
##
## Attaching package: 'rgl'
## The following object is masked from 'package:qdap':
##
## %>%
##
## Attaching package: 'LSAfun'
## The following object is masked from 'package:purrr':
##
## compose
library(cluster)
getting the required functions
clean_corpus <- function(corpus){
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Transform to lower cases
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove stopwrods
corpus <- tm_map(corpus, removeWords, stopwords('en'))
# Strip Whitespace
corpus <- tm_map(corpus, stripWhitespace)
return(corpus)
}
create_genre <- function(df, genre, stem = F, tdm = T, sparse = 0.95, frequency = F){
df <- df[df$genre == genre, ]
df <- select(df, lyrics)
df <- df$lyrics
df_source <- VectorSource(df)
df_corpus <- VCorpus(df_source)
clean_corp <- clean_corpus(df_corpus)
if(stem){
clean_corp <- tm_map(clean_corp, stemDocument)
} else{
clean_corp
}
if(tdm){
clean_m <- TermDocumentMatrix(clean_corp)
} else{
clean_m <- DocumentTermMatrix(clean_corp)
}
clean_non_sparse <- removeSparseTerms(clean_m, sparse = sparse)
clean_m <- as.matrix(clean_non_sparse)
if(frequency){
clean_m <- sort(rowSums(clean_m), decreasing = T)
} else{
clean_m
}
return(clean_m)
}
create_tdm_dtm <- function(df, genre, stem = F, tdm = T, sparse = 0.95){
df <- df[df$genre == genre, ]
df <- select(df, lyrics)
df <- df$lyrics
df_source <- VectorSource(df)
df_corpus <- VCorpus(df_source)
clean_corp <- clean_corpus(df_corpus)
if(stem){
clean_corp <- tm_map(clean_corp, stemDocument)
} else{
clean_corp
}
if(tdm){
clean_m <- TermDocumentMatrix(clean_corp)
} else{
clean_m <- DocumentTermMatrix(clean_corp)
}
clean_non_sparse <- removeSparseTerms(clean_m, sparse = sparse)
return(clean_non_sparse)
}
percentmiss = function(x){ sum(is.na(x))/length(x) *100 }
Importing the data
setwd('C:/Users/bneer/OneDrive/Desktop/Analyzing Human Language/Final project')
lyrics <- read.csv('lyrics.csv')
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string
lyrics <- lyrics[,-1]
head(lyrics[,-5])
## song year artist genre
## 1 ego-remix 2009 beyonce-knowles Pop
## 2 then-tell-me 2009 beyonce-knowles Pop
## 3 honesty 2009 beyonce-knowles Pop
## 4 you-are-my-rock 2009 beyonce-knowles Pop
## 5 black-culture 2009 beyonce-knowles Pop
## 6 all-i-could-do-was-cry 2009 beyonce-knowles Pop
Cleaning data
str(lyrics)
## 'data.frame': 339277 obs. of 5 variables:
## $ song : Factor w/ 236867 levels "0-0","0-0-0",..: 56272 205225 85347 233083 23435 8810 146916 219754 178029 228598 ...
## $ year : int 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
## $ artist: Factor w/ 17088 levels "009-sound-system",..: 4499 4499 4499 4499 4499 4499 4499 4499 4499 4499 ...
## $ genre : Factor w/ 12 levels "Country","Electronic",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ lyrics: Factor w/ 229599 levels "","\003Its what youre afraid of.\nAll of my fears,\nAll of my Faults.\nAll that came first,\nAll will be lost.",..: 141437 150816 108359 142333 149557 92450 187210 196954 16818 136101 ...
We only have 229599 unique lyrics in a data with 339228 rows. Clearly we have a case of duplicate rows.
Checking for duplicate rows (same lyrics at multiple places)
length(table(lyrics$lyrics)[table(lyrics$lyrics) > 1])
## [1] 12921
Removing duplicates
lyrics <- lyrics[!duplicated(lyrics$lyrics), ]
Checking for missing lyrics
sum(lyrics$lyrics == '')
## [1] 1
There is one song with no lyrics
Lets try to extract the lyrics
rownames(lyrics) <- 1:nrow(lyrics)
lyrics[lyrics$lyrics == '',]
## song year artist genre lyrics
## 143 lemonade 2016 beyonce-knowles Pop
lyrics$lyrics[143] <- tryCatch(genius_lyrics(artist = 'beyonce knowles',
song = 'lemonade', info = 'simple'),
error = function(x){return('')})
## Warning in request_GET(session, url): Not Found (HTTP 404).
The lyrics are not available in the database so we will remove this row.
lyrics <- lyrics[-143,]
sum(lyrics$lyrics == '')
## [1] 0
No more missing lyrics
length(table(lyrics$lyrics)[table(lyrics$lyrics) > 1])
## [1] 0
No more duplicates
Final sanity check
str(lyrics)
## 'data.frame': 229598 obs. of 5 variables:
## $ song : Factor w/ 236867 levels "0-0","0-0-0",..: 56272 205225 85347 233083 23435 8810 146916 219754 178029 228598 ...
## $ year : int 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
## $ artist: Factor w/ 17088 levels "009-sound-system",..: 4499 4499 4499 4499 4499 4499 4499 4499 4499 4499 ...
## $ genre : Factor w/ 12 levels "Country","Electronic",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ lyrics: Factor w/ 229599 levels "","\003Its what youre afraid of.\nAll of my fears,\nAll of my Faults.\nAll that came first,\nAll will be lost.",..: 141437 150816 108359 142333 149557 92450 187210 196954 16818 136101 ...
apply(lyrics, 2, percentmiss)
## song year artist genre lyrics
## 0 0 0 0 0
No missing data
levels(lyrics$genre)
## [1] "Country" "Electronic" "Folk" "Hip-Hop"
## [5] "Indie" "Jazz" "Metal" "Not Available"
## [9] "Other" "Pop" "R&B" "Rock"
We have some songs that we have no information about in terms of the genre. The best thing to do would be to classify it as other
lyrics$genre[lyrics$genre == 'Not Available'] <- 'Other'
lyrics$genre <- factor(lyrics$genre)
levels(lyrics$genre)
## [1] "Country" "Electronic" "Folk" "Hip-Hop" "Indie"
## [6] "Jazz" "Metal" "Other" "Pop" "R&B"
## [11] "Rock"
ggplot(data = lyrics, aes(x = genre)) + geom_bar()
Most no of songs are Rock songs.
Processing data Convert the data into a tibble
lyrics_tibble <- lyrics
lyrics_tibble$lyrics <- as.character(lyrics_tibble$lyrics)
lyrics_tibble$song <- as.character(lyrics_tibble$song)
lyrics_tibble$artist <- as.character(lyrics_tibble$artist)
lyrics_tibble$genre <- as.character(lyrics_tibble$genre)
lyrics_tibble <- as_tibble(lyrics_tibble)
lyrics_tibble
## # A tibble: 229,598 x 5
## song year artist genre lyrics
## <chr> <int> <chr> <chr> <chr>
## 1 ego-remix 2009 beyonce-k~ Pop "Oh baby, how you doing?\nYou know~
## 2 then-tell-me 2009 beyonce-k~ Pop "playin' everything so easy,\nit's~
## 3 honesty 2009 beyonce-k~ Pop "If you search\nFor tenderness\nIt~
## 4 you-are-my-r~ 2009 beyonce-k~ Pop "Oh oh oh I, oh oh oh I\n[Verse 1:~
## 5 black-culture 2009 beyonce-k~ Pop "Party the people, the people the ~
## 6 all-i-could-~ 2009 beyonce-k~ Pop "I heard\nChurch bells ringing\nI ~
## 7 once-in-a-li~ 2009 beyonce-k~ Pop "This is just another day that I w~
## 8 waiting 2009 beyonce-k~ Pop "Waiting, waiting, waiting, waitin~
## 9 slow-love 2009 beyonce-k~ Pop "[Verse 1:]\nI read all of the mag~
## 10 why-don-t-yo~ 2009 beyonce-k~ Pop "N-n-now, honey\nYou better sit do~
## # ... with 229,588 more rows
Creating a corpus for rock, pop and hip-hop genres
unique(lyrics_tibble$genre)
## [1] "Pop" "Hip-Hop" "Other" "Rock" "Metal"
## [6] "Country" "Jazz" "Electronic" "Folk" "R&B"
## [11] "Indie"
rock_m <- create_genre(lyrics_tibble, 'Rock')
rock_freq <- create_genre(lyrics_tibble, 'Rock', frequency = T)
pop_m <- create_genre(lyrics_tibble, 'Pop')
pop_freq <- create_genre(lyrics_tibble, 'Pop', frequency = T)
hiphop_m <- create_genre(lyrics_tibble, 'Hip-Hop')
hiphop_freq <- create_genre(lyrics_tibble, 'Hip-Hop', frequency = T)
Here’s how the data looks
rock_m[1:10, 1:5]
## Docs
## Terms 1 2 3 4 5
## aint 3 1 1 1 4
## alone 0 0 0 0 0
## always 1 0 1 0 1
## another 1 1 0 1 0
## around 0 0 0 0 0
## away 0 0 0 0 0
## baby 0 0 0 0 0
## back 0 0 0 1 0
## bad 1 0 0 0 0
## behind 0 0 1 0 0
Wordclouds
rock_vec <- names(rock_freq)
pop_vec <- names(pop_freq)
hiphop_vec <- names(hiphop_freq)
wordcloud(rock_vec, rock_freq, max.words = 50, colors = 'red')
wordcloud(pop_vec, pop_freq, max.words = 50, colors = 'red')
wordcloud(hiphop_vec, hiphop_freq, max.words = 50, colors = 'red')
One thing to note here is nigga and niggas is represented seperately and this is what we are looking to avoid and thus we need stemming.
Activating the stemming option
rock_m <- create_genre(lyrics_tibble, 'Rock', stem = T)
rock_freq <- create_genre(lyrics_tibble, 'Rock', stem = T, frequency = T)
pop_m <- create_genre(lyrics_tibble, 'Pop', stem = T)
pop_freq <- create_genre(lyrics_tibble, 'Pop', stem = T, frequency = T)
hiphop_m <- create_genre(lyrics_tibble, 'Hip-Hop', stem = T)
hiphop_freq <- create_genre(lyrics_tibble, 'Hip-Hop', stem = T, frequency = T)
Recreating the wordclouds
rock_vec <- names(rock_freq)
pop_vec <- names(pop_freq)
hiphop_vec <- names(hiphop_freq)
Rock
wordcloud(rock_vec, rock_freq, max.words = 50, colors = 'red')
clearly looks like Rock songs are mostly about someone indicated by words such as know, like, dont, your, now, world, heart, love, live, feel
clearly rock songs are about someone you love
Pop
wordcloud(pop_vec, pop_freq, max.words = 50, colors = 'red')
Interesting to see that the most frequent words used in rock music and pop music are very similar to each other.
Hip-Hop
wordcloud(hiphop_vec, hiphop_freq, max.words = 50, colors = 'red')
looks like baby, shit, man, nigga, girl, money, the b word and the f word are the most frequently used words in hip hop music that indicates its more for people who listen to raps as these are commonly used rap words
Coomonality clouds Plotting a wordcloud of common lyric words between rock and pop
rock <- lyrics[lyrics$genre == 'Rock', 'lyrics']
rock <- as.character(rock)
pop <- lyrics[lyrics$genre == 'Pop', 'lyrics']
pop <- as.character(pop)
all_rock <- paste(rock, collapse = " ")
all_pop <- paste(pop, collapse = " ")
pop_rock <- c(all_rock, all_pop)
pop_rock_source <- VectorSource(pop_rock)
pop_rock_corpus <- VCorpus(pop_rock_source)
pop_rock_clean <- clean_corpus(pop_rock_corpus)
pop_rock_tdm <- TermDocumentMatrix(pop_rock_clean)
pop_rock_tdm_non_sparse <- removeSparseTerms(pop_rock_tdm, sparse = 0.95)
pop_rock_m <- as.matrix(pop_rock_tdm)
colnames(pop_rock_m) <- c('Rock', 'Pop')
# word cloud of common words
commonality.cloud(pop_rock_m, max.words = 50, colors = 'steelblue1')
rm(rock)
rm(pop)
rm(all_rock)
rm(all_pop)
rm(pop_rock)
rm(pop_rock_source)
rm(pop_rock_corpus)
rm(pop_rock_clean)
rm(pop_rock_tdm)
rm(pop_rock_tdm_non_sparse)
rm(pop_rock_m)
we can see that know, like, don’t, you’re, now, world, heart, baby are the common words between rock and pop and thus this confirms that both pop and rock are about someone you love.
Word associations One thing to explore is that what are the words close to love in rock, pop and hip-hop?
rock_dtm <- create_tdm_dtm(lyrics_tibble, 'Rock', stem = T)
pop_dtm <- create_tdm_dtm(lyrics_tibble, 'Pop', stem = T)
hiphop_dtm <- create_tdm_dtm(lyrics_tibble, 'Hip-Hop', stem = T)
Rock
associations <- findAssocs(rock_dtm, 'love', 0.1)
associations_df <- list_vect2df(associations, col2 = 'word', col3 = 'score')
ggplot(data = associations_df, aes(score, word)) +
geom_point(size = 3) +
theme_gdocs()
Pop
associations <- findAssocs(pop_dtm, 'love', 0.1)
associations_df <- list_vect2df(associations, col2 = 'word', col3 = 'score')
ggplot(data = associations_df, aes(score, word)) +
geom_point(size = 3) +
theme_gdocs()
Hip-Hop
associations <- findAssocs(hiphop_dtm, 'love', 0.1)
associations_df <- list_vect2df(associations, col2 = 'word', col3 = 'score')
ggplot(data = associations_df, aes(score, word)) +
geom_point(size = 3) +
theme_gdocs()
In rock and pop words that are more closely associated with love are heart, need, baby, feel, true. While in hip-hop too they are close to love but there also exists words like girl and hate that indicate its not only used in the same romantic sense as in case of rock and pop but also in a typical rapper format where its tied with girl and even hate.
Lets find associations for the term nigga
associations_nigga <- findAssocs(hiphop_dtm, 'nigga', 0.2)
associations_nigga_df <- list_vect2df(associations_nigga, col2 = 'word',
col3 = 'score')
ggplot(data = associations_nigga_df, aes(score, word)) +
geom_point(size = 3) +
theme_gdocs()
S word, b word and f word are the closest to nigga and this makes sense as in a rap context most of the times we see lyrics that have s— nigga, f— nigga, b—- nigga.
Exploring the association of words close to money in hip-hop music
associations_money <- findAssocs(hiphop_dtm, 'money', 0.1)
associations_money_df <- list_vect2df(associations_money, col2 = 'word',
col3 = 'score')
ggplot(data = associations_money_df, aes(score, word)) +
geom_point(size = 3) +
theme_gdocs()
As expected cash is the closest but b—-, nigga, buy, bank, hundr, dirty, dough are all words so commonly close to each other as part of raps.
Clustering 1) Rock
rock_dtm2 <- removeSparseTerms(rock_dtm, sparse = 0.75)
rock_df <- as.data.frame(as.matrix(rock_dtm2))
rock_dist <- dist(rock_df)
hc_rock <- hclust(rock_dist, method = 'ward.D2')
hc_rock
##
## Call:
## hclust(d = rock_dist, method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 21
plot(hc_rock)
Using silhoutte width to identify optimal no of clusters
sapply(2:11, function(x) summary(silhouette(cutree(hc_rock, k = x), rock_dist))$avg.width)
## [1] 0.25041182 0.15706970 0.11688794 0.08058145 0.05602927 0.05468098
## [7] 0.04706088 0.04597518 0.04255026 0.04264160
looks like the optimal no of clusters are 2
{plot(hc_rock, hang = -1)
rect.hclust(hc_rock, k = 2)}
Looks like love fits has its own cluster. Lets remove that and see how the clustering goes
rock_df_nolove <- rock_df[!rownames(rock_df) %in% ('love'),]
rock_dist_nolove <- dist(rock_df_nolove)
hc_rock_nolove <- hclust(rock_dist_nolove, method = 'ward.D2')
hc_rock_nolove
##
## Call:
## hclust(d = rock_dist_nolove, method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 20
plot(hc_rock_nolove)
sapply(2:11, function(x) summary(silhouette(cutree(hc_rock_nolove, k = x), rock_dist_nolove))$avg.width)
## [1] 0.16492318 0.12273234 0.08461053 0.05883074 0.05741503 0.04941392
## [7] 0.04827394 0.04467778 0.04477368 0.03732417
{plot(hc_rock_nolove, hang = -1)
rect.hclust(hc_rock_nolove, k = 2)}
Now dont know has its own cluster. We saw from the word cloud that love, dont know are the most common word lyrics for rock music. There are many songs that start have things like I dont know how, I dont want to know, If you didn’t know. Thanks to the removal of stopwords and usage of stemming we are able to see that.
hiphop_dtm2 <- removeSparseTerms(hiphop_dtm, sparse = 0.60)
hiphop_df <- as.data.frame(as.matrix(hiphop_dtm2))
hiphop_dist <- dist(hiphop_df)
hc_hiphop <- hclust(hiphop_dist, method = 'ward.D2')
hc_hiphop
##
## Call:
## hclust(d = hiphop_dist, method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 28
plot(hc_hiphop)
sapply(2:11, function(x) summary(silhouette(cutree(hc_hiphop, k = x), hiphop_dist))$avg.width)
## [1] 0.31038093 0.29377729 0.29020628 0.17038019 0.17187303 0.07800756
## [7] 0.07945610 0.08239001 0.08342999 0.06204088
{plot(hc_hiphop, hang = -1)
rect.hclust(hc_hiphop, k = 2)}
Nigga, get like are in their own cluster and other rap words including f— and b—- along are in another cluster. Again thats consistent with what our wordclouds told us earlier. Whats interesting to see is the use of love in hip-hop. For rock love had its own cluster and was closer to your, get, like. Here love is in the same cluster close to aint, f— and shit, which indicates that they are used in different contexts in both forms of music.
Sentiment Aanalysis for the hip-hop genre Is there a relationship between different sentiments and time?
Extracting the sentiments
lyrics_tibble_hiphop <- filter(lyrics_tibble, genre == 'Hip-Hop')
tidy_lyrics <- lyrics_tibble_hiphop %>%
unnest_tokens(word, lyrics)
# word totals for each song
totals_hiphop <- tidy_lyrics %>%
count(song) %>%
rename(total_words = n)
lyric_counts <- tidy_lyrics %>%
left_join(totals_hiphop, by = 'song')
lyric_sentiment <- lyric_counts %>%
inner_join(get_sentiments('nrc'))
## Joining, by = "word"
How many sentiment each word has?
lyric_sentiment %>%
count(song, sentiment, sort = TRUE)
## # A tibble: 167,826 x 3
## song sentiment n
## <chr> <chr> <int>
## 1 bang-bang negative 500
## 2 intro negative 483
## 3 bang-bang anger 457
## 4 bang-bang disgust 420
## 5 bang-bang fear 420
## 6 money positive 420
## 7 intro positive 409
## 8 rap-monument negative 392
## 9 bang-bang sadness 381
## 10 bang-bang surprise 373
## # ... with 167,816 more rows
The most negative songs
lyric_sentiment %>%
# Count using three arguments
count(song, sentiment, total_words) %>%
ungroup() %>%
# Make a new percent column with mutate
mutate(percent = n/total_words) %>%
# Filter for only negative words
filter(sentiment == 'negative') %>%
# Arrange by descending percent
arrange(desc(percent))
## # A tibble: 17,388 x 5
## song sentiment total_words n percent
## <chr> <chr> <int> <int> <dbl>
## 1 boy-oh-boy-thugli-remix negative 47 19 0.404
## 2 riot-fight negative 81 30 0.370
## 3 where negative 17 6 0.353
## 4 mud-digger negative 9 3 0.333
## 5 charlie-manson negative 423 138 0.326
## 6 shaky-shaky-remix negative 605 171 0.283
## 7 red-opps negative 447 121 0.271
## 8 where-we-from negative 45 12 0.267
## 9 rock-shyt negative 4 1 0.25
## 10 mo-thug-interlude negative 229 55 0.240
## # ... with 17,378 more rows
The most positive words
lyric_sentiment %>%
count(song, sentiment, total_words) %>%
ungroup() %>%
mutate(percent = n/total_words) %>%
filter(sentiment == 'positive') %>%
arrange(desc(percent))
## # A tibble: 17,419 x 5
## song sentiment total_words n percent
## <chr> <chr> <int> <int> <dbl>
## 1 holy-god positive 153 94 0.614
## 2 agnus-dei positive 76 32 0.421
## 3 we-found-love-remix positive 10 4 0.4
## 4 triune-god positive 144 49 0.340
## 5 shattered positive 248 83 0.335
## 6 hot-metal positive 12 4 0.333
## 7 packet-prelude positive 6 2 0.333
## 8 so-real positive 3 1 0.333
## 9 praise-his-holy-name positive 72 23 0.319
## 10 o-christmas-tree-o-tannenbaum positive 97 30 0.309
## # ... with 17,409 more rows
All the sentiments
unique(lyric_sentiment$sentiment)
## [1] "positive" "anger" "negative" "disgust"
## [5] "fear" "sadness" "anticipation" "joy"
## [9] "surprise" "trust"
Lets look at song evolution over time First look at all the unique years
unique(lyric_sentiment$year)
## [1] 2007 1998 2006 2002 1995 2009 2010 2012 2015 2014 2013 2011 2008 2016
## [15] 2004 2005 2003 1992 702 1989 1996 1999 1994 2001 2000 112 1991 1990
## [29] 1982 1993 1997
We can see that clearly there is a mistake. Tow of the years are 112 and 702. Unless we are taking about the after christ era thats not possible.
Lets get the songs for these years
faulty_years <- lyric_sentiment %>%
filter(year == 702 | year == 112)
song_faulty <- faulty_years %>% group_by(song, artist) %>%
summarise(count = n())
song_faulty
## # A tibble: 4 x 3
## # Groups: song [4]
## song artist count
## <chr> <chr> <int>
## 1 anywhere-remix dru-hill 96
## 2 come-see-me-remix black-rob 103
## 3 it-s-over-now-remix g-dep 103
## 4 star clipse 50
Now that we know the song name and artist its easy to find the years. We look up on google or use the gen_song_url function. We wil ldo both.
Song url through genius package (Look at reference 1 for the google links.)
for(i in 1:nrow(song_faulty)){
print(gen_song_url(artist = song_faulty$artist[i], song = song_faulty$song[i]))
}
## [1] "https://genius.com/dru-hill-anywhere-remix-lyrics"
## [1] "https://genius.com/black-rob-come-see-me-remix-lyrics"
## [1] "https://genius.com/g-dep-it-s-over-now-remix-lyrics"
## [1] "https://genius.com/clipse-star-lyrics"
anywhere-remix by dru-hill was composed in 1999. (ref 4) come-see-me-remix by black-rob was composed in 1996. (ref 4) it-s-over-now-remix was composed in 2001. (ref 4) star was composed in 2002. (ref 4)
Correcting the years
lyric_sentiment[lyric_sentiment$song == 'anywhere-remix', ]$year <- 1999
lyric_sentiment[lyric_sentiment$song == 'come-see-me-remix', ]$year <- 1996
lyric_sentiment[lyric_sentiment$song == 'it-s-over-now-remix', ]$year <- 2001
lyric_sentiment[lyric_sentiment$song == 'star', ]$year <- 2002
Checking again
unique(lyric_sentiment$year)
## [1] 2007 1998 2006 2002 1995 2009 2010 2012 2015 2014 2013 2011 2008 2016
## [15] 2004 2005 2003 1992 1989 1996 1999 1994 2001 2000 1991 1990 1982 1993
## [29] 1997
Looks good now.
The trend in emotional content in lyrics in Hip-hop from the early 80s – 2016
What emotions are most represented in hip-hop
# Creating a copy of the file so that if we go wrong we have the original
new_hip_hop <- lyric_sentiment
new_hip_hop %>%
ggplot(aes(x = factor(sentiment))) +
geom_bar()
Mostly negative and positive sentiments.
Postive sentiments
new_hip_hop %>%
filter(sentiment == 'positive') %>%
count(song, year, total_words) %>%
ungroup() %>%
mutate(percent = n / total_words,
year = 10 * floor(year / 10)) %>%
ggplot(aes(x = factor(year), y = percent)) +
geom_boxplot()
Looks like from the 80s till now th trend in positive words does not appear to have changed by much. Hip-hop lyrics appear to have used positive words consistently from the 80s till today.
Negative words
new_hip_hop %>%
filter(sentiment == 'negative') %>%
count(song, year, total_words) %>%
ungroup() %>%
mutate(percent = n / total_words,
year = 10 * floor(year / 10)) %>%
ggplot(aes(x = factor(year), y = percent)) +
geom_boxplot()
This is interesting. There was a higher usage of negative words in the 1980s as compared to the 1990s and has been consistent since then so usage of negative words appears to have decreased over time atleast when you compare between the 80s and the 90s.
Fear
new_hip_hop %>%
filter(sentiment == 'fear') %>%
count(song, year, total_words) %>%
ungroup() %>%
mutate(percent = n / total_words,
year = 10 * floor(year / 10)) %>%
ggplot(aes(x = factor(year), y = percent)) +
geom_boxplot()
The usage of fearful words indicate a right skew as there is a higher concentration of points of word frequencies above the median. Again as in case of negative words fearful words appear to have decreased when you compare the 80s and 90s and stable since then.
Joyful words
new_hip_hop %>%
filter(sentiment == 'joy') %>%
count(song, year, total_words) %>%
ungroup() %>%
mutate(percent = n / total_words,
year = 10 * floor(year / 10)) %>%
ggplot(aes(x = factor(year), y = percent)) +
geom_boxplot()
Again there appears to be some difference when you compare the 80s and 90s in usage of joyful words.
Lets model these relationships
Positive
pos_by_year <- lyric_sentiment %>%
# Filter for positive words
filter(sentiment == 'positive') %>%
count(song, year, total_words) %>%
ungroup() %>%
# Define a new column: percent
mutate(percent = n/total_words)
model_pos_emo <- lm(percent ~ year, data = pos_by_year)
summary(model_pos_emo)
##
## Call:
## lm(formula = percent ~ year, data = pos_by_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03998 -0.01683 -0.00383 0.01065 0.57485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.932e-01 9.758e-02 1.980 0.0477 *
## year -7.653e-05 4.858e-05 -1.575 0.1152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02778 on 19146 degrees of freedom
## Multiple R-squared: 0.0001296, Adjusted R-squared: 7.739e-05
## F-statistic: 2.482 on 1 and 19146 DF, p-value: 0.1152
As expected year does not have a significant relationship with the positive words in hip-hop lyrics.
Negative
neg_by_year <- lyric_sentiment %>%
# Filter for negative words
filter(sentiment == 'negative') %>%
count(song, year, total_words) %>%
ungroup() %>%
mutate(percent = n/total_words)
model_neg_emo <- lm(percent ~ year, data = neg_by_year)
summary(model_neg_emo)
##
## Call:
## lm(formula = percent ~ year, data = neg_by_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.04481 -0.01822 -0.00302 0.01320 0.36474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.983e-01 9.290e-02 7.517 5.85e-14 ***
## year -3.271e-04 4.625e-05 -7.073 1.56e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02636 on 19090 degrees of freedom
## Multiple R-squared: 0.002614, Adjusted R-squared: 0.002562
## F-statistic: 50.03 on 1 and 19090 DF, p-value: 1.564e-12
Negative words however has a significant relationship with year indicating that the use of negative words in hip-hop lyrics has decreased over time from the 80s till now.
Fear
fear_by_year <- lyric_sentiment %>%
# Filter for negative words
filter(sentiment == 'fear') %>%
count(song, year, total_words) %>%
ungroup() %>%
mutate(percent = n/total_words)
model_fear_emo <- lm(percent ~ year, data = fear_by_year)
summary(model_fear_emo)
##
## Call:
## lm(formula = percent ~ year, data = fear_by_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02335 -0.01196 -0.00372 0.00696 0.47847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2488931 0.0666856 3.732 0.000190 ***
## year -0.0001128 0.0000332 -3.397 0.000682 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0186 on 18537 degrees of freedom
## Multiple R-squared: 0.0006222, Adjusted R-squared: 0.0005683
## F-statistic: 11.54 on 1 and 18537 DF, p-value: 0.0006822
Again there is a significant relationship between the proportion of fearful years with year indicating that the use of fearful words has decreased over time.
Joy
joy_by_year <- lyric_sentiment %>%
# Filter for negative words
filter(sentiment == 'joy') %>%
count(song, year, total_words) %>%
ungroup() %>%
mutate(percent = n/total_words)
model_joy_emo <- lm(percent ~ year, data = joy_by_year)
summary(model_joy_emo)
##
## Call:
## lm(formula = percent ~ year, data = joy_by_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02307 -0.01193 -0.00511 0.00477 0.37794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4942009 0.0757234 -6.526 6.91e-11 ***
## year 0.0002567 0.0000377 6.810 1.01e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.021 on 18367 degrees of freedom
## Multiple R-squared: 0.002519, Adjusted R-squared: 0.002464
## F-statistic: 46.38 on 1 and 18367 DF, p-value: 1.005e-11
As our boxplots suggested the use of joy ful words have decreased over time.
Now we have seen the sentiments individually till now but what if we combine them and see if the usage of words that are positve or joyful in hip-hop lyrics have changed when compared with fearful and negative.
Positive and joyful
posjoy_by_year <- lyric_sentiment %>%
# Filter for negative words
filter(sentiment == 'positive' | sentiment == 'joy') %>%
count(song, year, total_words) %>%
ungroup() %>%
mutate(percent = n/total_words)
model_posjoy_emo <- lm(percent ~ year, data = posjoy_by_year)
summary(model_posjoy_emo)
##
## Call:
## lm(formula = percent ~ year, data = posjoy_by_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.06034 -0.02725 -0.00813 0.01502 0.73962
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2144800 0.1633071 -1.313 0.1891
## year 0.0001367 0.0000813 1.681 0.0927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0465 on 19149 degrees of freedom
## Multiple R-squared: 0.0001476, Adjusted R-squared: 9.537e-05
## F-statistic: 2.827 on 1 and 19149 DF, p-value: 0.09274
There isn’t a significant relationship between the usage of positive and joyful words with year.
Negative and fearful
negfear_by_year <- lyric_sentiment %>%
# Filter for negative words
filter(sentiment == 'negative' | sentiment == 'fear') %>%
count(song, year, total_words) %>%
ungroup() %>%
mutate(percent = n/total_words)
model_negfear_emo <- lm(percent ~ year, data = negfear_by_year)
summary(model_negfear_emo)
##
## Call:
## lm(formula = percent ~ year, data = negfear_by_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.06781 -0.02815 -0.00549 0.01972 0.65863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.030e+00 1.458e-01 7.063 1.68e-12 ***
## year -4.814e-04 7.256e-05 -6.634 3.37e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04156 on 19175 degrees of freedom
## Multiple R-squared: 0.00229, Adjusted R-squared: 0.002238
## F-statistic: 44 on 1 and 19175 DF, p-value: 3.365e-11
Again negative and fearful words have decreased over time.
tf-idf transformation
what is tf-idf
knitr::include_graphics("Capture.PNG")
This kind of transofrmation allows us to weigh words that occur more frequently in some documents rather than words that occur more frequently in every document.
hiphop_weight <- lw_logtf(hiphop_m) * gw_idf(hiphop_m)
hiphop_weight[1:10,1:5]
## Docs
## Terms 1 2 3 4 5
## act 0.000000 0.000000 5.614559 0.000000 0
## aint 2.874354 0.000000 0.000000 3.199968 0
## air 0.000000 0.000000 0.000000 0.000000 0
## alon 0.000000 0.000000 0.000000 0.000000 0
## alreadi 10.379107 3.459702 0.000000 0.000000 0
## alway 0.000000 0.000000 0.000000 6.309563 0
## anoth 0.000000 0.000000 0.000000 2.556010 0
## anyth 0.000000 0.000000 0.000000 0.000000 0
## arm 0.000000 0.000000 0.000000 0.000000 0
## around 0.000000 2.038425 0.000000 0.000000 0
LSA or Singular value decomposition Now we want to know the likelihood of a word being part of a certain lyric.
hiphop_lsa <- lsa(hiphop_weight)
hiphop_lsa$tk[1:20,1:5]
## [,1] [,2] [,3] [,4] [,5]
## act -0.03480244 0.008100931 -0.0117173074 0.021668166 -1.645902e-02
## aint -0.11087805 0.049291005 -0.0574075158 -0.081649567 4.034618e-03
## air -0.02466065 -0.001073388 -0.0020002633 0.036047161 -1.537891e-02
## alon -0.02425898 -0.058879005 0.0025335214 -0.036290013 3.527319e-02
## alreadi -0.02099539 -0.009305805 -0.0221289490 -0.012966318 -1.474174e-02
## alway -0.04400585 -0.054321844 0.0094484544 -0.039458815 -1.153625e-02
## anoth -0.04141720 -0.025157073 0.0279065097 -0.009862423 -1.041954e-02
## anyth -0.01898343 -0.027999761 -0.0137310848 -0.020928304 -7.026555e-05
## arm -0.01818028 -0.003579844 0.0227906092 0.014086706 -2.284745e-03
## around -0.05772882 -0.002707595 -0.0100973361 0.007742271 6.688059e-03
## ask -0.04220455 -0.008738280 -0.0006345552 0.003754670 -3.964757e-02
## ass -0.06959866 0.122095289 -0.0345856885 0.041283926 5.135187e-02
## away -0.03789859 -0.075999214 0.0354763858 -0.055598721 4.191956e-02
## babi -0.07387702 -0.103498801 -0.2395487446 -0.010093611 1.820231e-01
## back -0.09117613 0.008310097 0.0086951455 0.045443585 -1.356482e-02
## bad -0.04442159 0.005976440 -0.0311681246 -0.007291067 -3.513891e-02
## bag -0.02596100 0.029919505 -0.0205709832 0.011840158 -6.211249e-02
## ball -0.03168073 0.038758049 -0.0205514473 0.001686696 -5.190880e-02
## bang -0.02634671 0.048980106 0.0062113353 0.011313392 2.193570e-02
## bank -0.01920501 0.029773050 -0.0178481910 -0.006116624 -4.013172e-02
Here we can say that bank for e.g. is more likely to occur in lyric 2 and less likely in lyrics 1,3,4,5. Air is more likely to occur in lyrics 4 as compared to lyrics 1,2,3,5. This matrix allows us to see what words are likely to be part of the same lyrics and thus more likely to be close to one another. Using tf-idf and lsa and then finding these neighboring words provides us better results as compared to using just their frequencies and plotting based on associations as now we have consdered the words more likely to occur in some lyrics by giving them higher weights and by lowering the weights of words likely to occur in every lyric through tf-idf. Using the LSA on the tf-idf matrix we have estimated the likelihood of a word being part of a certian lyric and thus those that have more likelihoods of being in the same lyrics are likely to be closer to one another. For e.g. b–g, ball, bank, bad are all likely to be a part of lyrics 2.
Visualizing the nearest neighboring words by converting the lsa matrix to a text matrix
hiphop_lsa <- as.textmatrix(hiphop_lsa)
Does Hip-hop actually need explicit words to make it sound better, what about words like god, love, fight? It would be beneficial to know in what context are these words used in rock and hiphop.
We saw earlier the associations of words with love lets look at the nearest neighbors of the word love
plot_neighbors('love', n = 10, tvectors = hiphop_lsa, method = 'MDS',
dims = 2)
## x y
## love 0.392197538 0.007502624
## know 0.001252845 0.034051261
## just -0.003667512 0.021808355
## dont -0.032886954 0.011014319
## make -0.019187902 -0.167634249
## like -0.054413510 -0.018198256
## now -0.057300014 0.027333658
## got -0.069406422 0.003463801
## caus -0.066310244 0.097571780
## get -0.090277827 -0.016913293
Now we can see that even though not close enough as you would normally want but words like make, cause, just, get are the nearest neighbors of love and it makes sense as the use of love in hip-hop songs is different from the way it is used in rock and pop songs where it is in a more romantic sense.
Lets confirm that by looking at the use of love in rock music
rock_weight <- lw_logtf(rock_m) * gw_idf(rock_m)
rock_lsa <- lsa(rock_weight)
rock_lsa <- as.textmatrix(rock_lsa)
plot_neighbors('love', n = 10, tvectors = rock_lsa, method = 'MDS',
dims = 2)
## x y
## love -0.22638180 -0.03903309
## true -0.10177218 0.07500461
## kiss -0.17761713 -0.26971970
## sure 0.22161890 -0.03050484
## know 0.12411929 -0.03216807
## just 0.13655588 -0.09504119
## made 0.13788701 0.14327465
## smile 0.08054694 -0.23116227
## heart -0.35746376 0.20080408
## found 0.16250686 0.27854581
Here we can see that love is more surrounded by kiss, true, heart, smile, found that indicates that love is more used in a romantic sense in rock music as compared to a sort of a rapper way in hip-hop where it is more of a slang then a conveyance of a feeling.
Lets look at the use of the word god
Hip-Hop
plot_neighbors('god', n = 10, tvectors = hiphop_lsa, method = 'MDS',
dims = 2)
## x y
## god -0.20636151 -0.08765558
## bless -0.27632179 -0.05005261
## thank -0.29835300 -0.17231468
## pray -0.25695484 -0.01435238
## save -0.05194345 0.32872585
## power 0.05176251 0.34657782
## swear 0.25265300 -0.40517210
## see 0.24004275 0.04081413
## like 0.28985716 0.03666967
## just 0.25561916 -0.02324012
As expected pray, bless, thank, save, power, swear are the words that are close to the use of word god.
Rock
plot_neighbors('god', n = 10, tvectors = rock_lsa, method = 'MDS',
dims = 2)
## x y
## god 0.03008374 -0.20112406
## soul 0.03538635 -0.02188907
## hell 0.28507869 -0.07502462
## name -0.11188762 -0.37839246
## hope -0.07976144 0.24723374
## must -0.11745096 0.01426231
## made -0.14684025 -0.03124707
## help -0.30477211 0.15650035
## fear 0.04271459 0.17264530
## dead 0.36744901 0.11703558
Fear, soul, hope, hell, help, dead are the words surrounding god for rock music that clearly indicates that the word god over here is used more in context with a person’s soulmate rather than using it generally for prayers and blessings as used by hip-hop music.
Now lets look at the word fight Hip-Hop
plot_neighbors('fight', n = 10, tvectors = hiphop_lsa, method = 'MDS',
dims = 2)
## x y
## fight 0.36371805 -0.30704761
## dont -0.08884576 0.01431749
## back -0.10621436 -0.03111350
## get -0.07724950 0.05871461
## power 0.45401446 0.24348586
## right -0.12914871 -0.20502861
## just -0.10095056 0.04593225
## like -0.07859474 0.07558688
## caus -0.10794463 0.06123524
## know -0.12878425 0.04391740
Rock
plot_neighbors('fight', n = 10, tvectors = rock_lsa, method = 'MDS',
dims = 2)
## x y
## fight 0.011616849 -0.118284455
## side -0.003458095 -0.068949821
## stand 0.258603923 -0.073485971
## right -0.339230686 0.005197094
## line 0.155664208 0.105540440
## fear 0.213181366 -0.342559723
## work 0.131140913 0.307133393
## what -0.161678221 0.026009596
## wrong -0.405077866 -0.037173196
## put 0.139237610 0.196572641
The word fight is surrounded by power, back, dont for hip-hop music indicating that it is used in its literal sense as fighting with someone as thats when you would say fight back or dont fight. In Rock music however fight is surrounded by side, stand, line, right, wrong. This indicates that here fight is not used in literal sense as this indicates fighting for someone or having them stand by your side.
Now we saw how words like love, god, fight are used in different context in two different genres of music.
Lets look at cosine similarities between these words in both our vector spaces
list1 <- c('god', 'love', 'fight')
print('Hip-Hop')
## [1] "Hip-Hop"
multicos(list1, tvectors = hiphop_lsa)
## god love fight
## god 1.0000000 0.2754689 0.2501409
## love 0.2754689 1.0000000 0.2630621
## fight 0.2501409 0.2630621 1.0000000
print('Rock')
## [1] "Rock"
multicos(list1, tvectors = rock_lsa)
## god love fight
## god 1.0000000 0.1807514 0.1911263
## love 0.1807514 1.0000000 0.2385054
## fight 0.1911263 0.2385054 1.0000000
We can see that in hip-hop music god has a higher cosine similarity with love than fight whereas in rock music fight has a higher cosine similarity with love than god.
Summarize the results from your study in as plain of language as possible. How does this relate to previous literature? Where the results supportive of your hypotheses? What have we learned from you doing this analysis/study?
We saw how much of an overlap we have in rock and pop music from their wordclouds that helped us see how similar these genres are in terms of the words used in their lyrics. We also saw how the the usage of certain words have changed over time especially when compared with the 80s. In 80s we had a much higher frequency of negative and fearful words than we do now. This decline started from the 90s. We also saw how words like god, love and fight are surrounded by different set of words in rock and hip hop genres. Love was used to convey feelings in rock music whereas in hip-hop it was used with other hip-hop slangs. God was used in a way its used in normal circumstances with prayer and blessings in hip-hop genre whereas in the Rock genre it was used with soul, hope that indicated it was being used to convey feelings towards some person. Thus our analysis confirmed our theories stated in our problem statement.
Many previous studies in music have been focused in the area of sentiment analytics when it comes to music in fact many studies have explored sentiments even further than we have by measuring sentiment scores and extracting top songs representing each sentiments. However, we have gone a slightly different route from traditional music analytics studies as we have also tried to see how some words can be represented in different manner in two different genres of music.
There is definitely a lot of other things to explore with this data such as how did the proportions of sentiments changed over time for each genre. In fact this could also tell us which genre had more or less negative or positive words now than it had before and also not just negative words but even other individual words such as the n—- word that can be explored further to see how overtime its use has changed and who are the artists who have used it most no of times over the years and how that has changed.
Include your references in APA style.
Getting information about different music genres: “The Genealogy and History of Popular Music Genres.” Musicmap, musicmap.info/.
Rock vs Pop: Kivumbi. “Difference Between.” Difference Between Similar Terms and Objects, 20 Feb. 2011, www.differencebetween.net/miscellaneous/difference-between-rock-and-pop/.
Datset: GyanendraMishra. “380,000+ Lyrics from MetroLyrics.” Kaggle, 11 Jan. 2017, www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics.
Song year:
any-where remix dru-hill “Shyne Discography.” Wikipedia, Wikimedia Foundation, 17 May 2018, en.wikipedia.org/wiki/Shyne_discography.
come-see-me-remix black rob “112 (Ft. Black Rob) – Come See Me (Remix).” Genius, 21 Oct. 1996, genius.com/112-come-see-me-remix-lyrics.
it-s-over-now-remix g-dep “112 - It’s Over Now (Remixes).” Discogs, 1 Jan. 1970, www.discogs.com/112-Its-Over-Now-Remixes/release/1637082.
star clipse “702 (Ft. Clipse) – Star.” Genius, 10 Dec. 2002, genius.com/702-star-lyrics.