DACSS 601 Final
The dataset I am using for my final project is a set taken from ‘Kaggle’, containing all of the songs and lyrics from Taylor Swift’s discography up until 2017. Due to the wide range of albums containing 20+ songs each, I will be comparing the lyrics of the first album ‘Taylor Swift’ and the most recent in the data set ‘Reputation’. Later in my research, I will introduce the middle albums in order to further emphasize the trends being shown throughout the album and the themes in each one individually.
Music is considered to be one of the universal languages. It is something that we can find in every culture, no matter where we look and it comes in all different forms. Over time, it has evolved drastically. It is a likely assumption that most people across the world listen to one form of music or another in their day-to-day life. However, the music that one listens to can range drastically. Focusing here in the United States, musicians are considered to be high-class celebrities, reaching or surpassing ranks of popularity in comparison to thespians or even, politicians. According to the U.S. Bureau of Labor Statistics, there are roughly 25,000 musicians and singers actively working as of May 2021, including other concentrations not limited to: schools, recording studios, and various religious organizations. At some point during careers as a music professional, you can expect for there to be change and growth within the music. Not every artist wants to sing the same types of music forever, and in some genres with more saturation, there is a need to differentiate from the norm in order to stand out more, especially important for independent and smaller artists.
For the focus of my research, we are going to be looking at Taylor Swift, someone who started off as a country singer before making a smooth and relatively favorable transition to mainstream pop music. This is someone who has achieved success in both genres. In terms of why the switch had been made, that is not for me to determine. However, we can look at the evolution between the first few albums and the last few in respect to:
#This is a preview of the data set and its respective columns.
head(Swift_lyrics)
# A tibble: 6 x 7
artist album track_title track_n lyric line year
<chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Taylor Swift Taylor Swift Tim McGraw 1 "He said ~ 1 2006
2 Taylor Swift Taylor Swift Tim McGraw 1 "Put thos~ 2 2006
3 Taylor Swift Taylor Swift Tim McGraw 1 "I said, ~ 3 2006
4 Taylor Swift Taylor Swift Tim McGraw 1 "Just a b~ 4 2006
5 Taylor Swift Taylor Swift Tim McGraw 1 "That had~ 5 2006
6 Taylor Swift Taylor Swift Tim McGraw 1 "On backr~ 6 2006
rmarkdown::paged_table(Swift_lyrics)
The variables within the data set include:
colnames(Swift_lyrics)
[1] "artist" "album" "track_title" "track_n"
[5] "lyric" "line" "year"
The columns can be shown here as:
Based on the size of the data set, there are many factors that do not need to be included in our research. For example, articles in the lyrics such as “I”, “a”, “we”, “the”, etc. do not need to be included, as those words would clearly be the most popular out of any song based on its prevalence when forming sentences. These words will also not prove anything in terms of main themes and overall growth.
At this stage, I was able to use the ‘token’ feature to separate the ‘lyric’ column into individual words, in order to find trends within the lyrics. Since the data set contains multiple albums with 10+ tracks each, my research will filter out the middle albums to show the progression between the first studio album and the last one1
The visualizations shown below can help the average reader understand what I am trying to show in my research. It is one thing to type out the most popular words in the songs, but it is a completely different thing to show which words are the most popular, how the words appear in comparison to other albums, and what can be drawn from it.
word_count %>%
ggplot() +
geom_histogram(aes(x = num_words), fill = "#a458c4") +
labs(title ="Word Count Overall", y="# of Songs", x="Words Per Song")
I am visualizing the word count per song which is a way for me to dissect the lyrics and word types within the songs later on in the project. We can conclude at this point that the word count per songs tend to be within the 300-500 range. With this information, we can see how many potential words we are working with and how often the words within these songs are repeated, and how the words may change.
tidy_tslyric <- tidy_lyric %>%
anti_join(stop_words) %>%
filter(nchar(word) > 3)
#This filters out words that are less than 3 characters, which filters out the articles and shorter words.
tidy_tslyric %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
ungroup() %>%
mutate(word = reorder(word,n)) %>%
ggplot() +
geom_col(aes(word, n), fill = "#FFDAB8") +
labs(title="Frequently Used Words", x="Words", y="Times Used Overall") +
coord_flip() +
theme_minimal()
#This represents the words, on the y axis, with how many times they are said, on the x axis.
Here, I am visualizing the frequency of certain words within the discography and which ones appear the most. The small chunk of code at the top is being used to filter out articles such as “I”, “a”, “the”, etc. which are unimportant words in this research. We could reasonably conclude with no tidying at all that these articles and short words would appear the most frequently. I am interested in every word besides them.
A word cloud is a type of visualization that shows the frequency of words within a given set, with the most seen words appearing bigger in comparison to the words that are used the least.
select(tidy_tslyric, word)
# A tibble: 8,383 x 1
word
<chr>
1 blue
2 eyes
3 shined
4 georgia
5 stars
6 shame
7 night
8 chevy
9 truck
10 tendency
# ... with 8,373 more rows
# This shows us all of the words used across all of her songs, starting from the first album.
docs <- Corpus(VectorSource(tidy_tslyric$word))
toSpace = content_transformer(
function (x, pattern)
gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_ts <- data.frame(word = names(f), freq=f)
# The size of each of the words represents how often it shows up in the lyrics.
wordcloud(words = df_ts$word,
freq = df_ts$freq,
min.freq = 5,
max.words = 150,
random.order = FALSE,
rot.per = 0.45,
colors = brewer.pal(8, "Pastel2"))
#This showns us a clearer, numerical version of the most frequently used words and how much they appear.
rmarkdown::paged_table(df_ts)
Here, I am using the data frames from previous cleanings to show a different way of visualizing the most frequently used words. The head is showing us exactly how many times the words are used within the songs, which goes along with the previous visualization showing us the large jump between the most frequent words and the others.
#I am creating the data frames for each of the albums and their corresponding words.
albums_compare <- select(tidy_tslyric, album, word)
row.names(albums_compare) <- paste("row", 1:nrow(albums_compare))
album1 <- filter(albums_compare, album == "Taylor Swift")
album2 <- filter(albums_compare, album == "Fearless")
album3 <- filter(albums_compare, album == "Speak Now")
album4 <- filter(albums_compare, album == "Red")
album5 <- filter(albums_compare, album == "1989")
album6 <- filter(albums_compare, album == "reputation")
Since analyzing each word individually will take up too much time, we can use facet wraps and grids to easily compare multiple variables across one space, which can allow for easier comparisons of the trends overall (in this case, words and the frequency the words are used.)
This section will contain all of the albums featured for more clarity as to the trends overtime. Each album is represented by a year, to show the changes (or lack thereof) overtime.
popwords <- tidy_tslyric %>%
group_by(album)%>%
count(word, album, sort = TRUE) %>%
slice(seq_len(10)) %>%
ungroup() %>%
arrange(album,n) %>%
mutate(row = row_number())
library(ggplot2)
popwords %>%
ggplot() +
geom_col(aes(word, n), fill = "#83CC94", alpha =.5) +
labs(x = "Words", y = "# Word is Sung", subtitle = "the longer the bar, the most frequently the word is used") +
ggtitle("Popular Words by Album") +
facet_wrap(~album, scales = "free") +
coord_flip() +
theme_classic()
# Album 1: Taylor Swift- Most Frequent Words
docs <- Corpus(VectorSource(album1$word))
toSpace = content_transformer(
function (x, pattern)
gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_album1 <- data.frame(word = names(f), freq=f)
wordcloud(words = df_album1$word,
freq = df_album1$freq,
min.freq = 1,
max.words = 50,
random.order = FALSE,
rot.per = 0.45,
colors = brewer.pal(8, "Pastel2"))
rmarkdown::paged_table(df_album1)
The table here shows the frequency of the words in the album, “Taylor Swift” with the wordcloud to match.
#Album 2: Fearless- Most Frequent Words
docs <- Corpus(VectorSource(album2$word))
toSpace = content_transformer(
function (x, pattern)
gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_album2 <- data.frame(word = names(f), freq=f)
wordcloud(words = df_album2$word,
freq = df_album2$freq,
min.freq = 1,
max.words = 50,
random.order = FALSE,
rot.per = 0.45,
scale=c(4,0.50),
colors = brewer.pal(8, "Pastel2"))
rmarkdown::paged_table(df_album2)
The table here shows the frequency of the words in the album, “Fearless” with the wordcloud to match.
#Album 3: Speak Now- Most Frequent Words
docs <- Corpus(VectorSource(album3$word))
toSpace = content_transformer(
function (x, pattern)
gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_album3 <- data.frame(word = names(f), freq=f)
wordcloud(words = df_album3$word,
freq = df_album3$freq,
min.freq = 1,
max.words = 50,
random.order = FALSE,
rot.per = 0.45,
scale=c(4,0.50),
colors = brewer.pal(8, "Pastel2"))
rmarkdown::paged_table(df_album3)
The table here shows the frequency of the words in the album, “Speak Now” with the wordcloud to match.
#Album 4: Red- Most Frequent Words
docs <- Corpus(VectorSource(album4$word))
toSpace = content_transformer(
function (x, pattern)
gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_album4 <- data.frame(word = names(f), freq=f)
wordcloud(words = df_album4$word,
freq = df_album4$freq,
min.freq = 1,
max.words = 50,
random.order = FALSE,
rot.per = 0.45,
colors = brewer.pal(8, "Pastel2"))
rmarkdown::paged_table(df_album4)
The table here shows the frequency of the words in the album, “Red” with the wordcloud to match.
#Album 5: 1989- Most Frequent Words
docs <- Corpus(VectorSource(album5$word))
toSpace = content_transformer(
function (x, pattern)
gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_album5 <- data.frame(word = names(f), freq=f)
wordcloud(words = df_album5$word,
freq = df_album5$freq,
min.freq = 1,
max.words = 50,
random.order = FALSE,
rot.per = 0.45,
scale=c(4,1.75),
colors = brewer.pal(8, "Pastel2"))
rmarkdown::paged_table(df_album5)
The table here shows the frequency of the words in the album, “1989” with the wordcloud to match.
#Album 6: reputation- Most Frequent Words
docs <- Corpus(VectorSource(album6$word))
toSpace = content_transformer(
function (x, pattern)
gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_album6 <- data.frame(word = names(f), freq=f)
wordcloud(words = df_album6$word,
freq = df_album6$freq,
min.freq = 1,
max.words = 50,
random.order = FALSE,
rot.per = 0.35,
scale=c(4,1.75),
colors = brewer.pal(8, "Pastel2"))
rmarkdown::paged_table(df_album6)
The table here shows the frequency of the words in the album, “reputation” with the wordcloud to match.
This process has been quite the learning experience for me. With the help of the tutorials provided in the Google Classroom, the first few assignments within the project were a lot easier for me to understand. The difficulties came along once my ambitions got the better of me. I strongly preferred the ability to work and learn with R Studio on our own, rather than having multiple weekly assignments dictating our progress with the software. By having more time to explore the system ourselves, I felt more comfortable asking for help and even more confident in my ability of ‘trial-and-error’ when first learning how to use ggplot.
I also really enjoyed the freedom of choosing our own data sets when first introduced to the project. I was graced with the opportunity to choose something of my own interests, which also conveniently allowed me to take a strong interest in what I was researching. The size of my dataset did not prove to be as big of an issue as I thought it would be, although I do wish I searched more for one that contained all of the albums. My classmates via the class blog and the internet also proved to be helpful tools along the way, as the frustrations I shared with coding are universal both in and outside of our classroom. Revisiting the tutorials and chapters were also keys to my success, as this allowed me to compare the strings and vectors I was creating with a textbook example. These examples, along with my classmates, were used and treated as models going forward from HW1. I also appreciated Jason O’Connell’s example and solution to code folding that made my assignments look all that much neater. Although I wish I utilized the Slack server more, the questions and comments within our group came to be very useful in times of need.
My biggest challenges came from the type of data I was using. Most of my peers chose numerical data to analyze from the get-go and a lot of the work I had to focus on was turning my data into numeric (i.e. separating the lyrics into individual words and then creating a column for the frequencies of the words.) In some instances, I looked to the internet for solutions on how to better sculpt my work into something more digestible, as placing a bunch of words on a graph would fail to justify my initial research questions. The visualizations still remain the hardest part of the process to date, and at some points, I can humbly admit I bit off more than I could ever chew. Not only did I discover (and learn via tutorials) how to token my data into a way that the processor could graph, I came across a myriad of problems that never even occurred to me before starting the assignment, specifically eliminating the common English words and articles like “I”, “a”, “the”, etc. from my research into determining her most used words in songs. In lieu of all this, I would still not take anything back. I think it was important for me to push myself outside of my comfort zone, regardless of the errors and mishaps that occurred, and are still occurring.
What questions are left unanswered?
I think at this point, the clarification between the albums are not clear yet. Due to various struggles with the visualizations in the first place, these ggplots are referencing her discography as a whole and not based album-by-album. As I advance in my skills further, I will be able to separate the data by the two albums and re-convey the same information shown above but with a tighter focus.
Since this first question, I was better able to manipulate the data and separate the data frame based on the provided albums. However, my ability to compare the albums side by side is still in progress. If I were to continue the project further, some different routes I would take would be continuing to analyze the albums in terms of their sentiment: example found here- https://www.tidytextmining.com/sentiment.html
However, at this point in my coding career, being very new to R Studio, I would be hesitant to attempt something of this nature this early on. The thought did occur to me during the process but I am not confident in my ability to successfully replicate the example without significant guidance.
What might be unclear?
I think most people will be able to digest the data as is, but those who are unfamiliar with the artist might not care about the importance of the words, nor the frequency in which the words appear in songs. I figured a visualization such as a word cloud would be easier to digest than something like a box plot, which would provide a less clear visualization of my data.
How could I improve the visualizations?
I think the spacing of the graphs and the organization overall could be much improved. I am still working on the scales of the y-axis to make it look cleaner and easily understandable but at this point of the project, I am working with the defaults.
I think from the research I was able to gather and analyze, we can conclude that there is growth between the albums in terms of word usage but we cannot definitively conclude that the growth is seen as significant. We can see a lot of the same words used overtime and even without the context of the lyrics, we can reasonably conclude the emotions conveyed through each of the songs have stayed consistent over time.
Another thing I encountered that was tedious was my constant and frequent creation of different data frames. In addition to programming, I tend to hand write a lot of my thoughts and sequences out, but even with the amount of data frames I have made, it can be hard to follow and keep track of. I hope in my continuation of the DACSS classes, I am able to learn more about how to either: keep track of them all or, learn to create less and instead, use the ones I had before.
- “27-2042 Musicians and Singers.” U.S. Bureau of Labor Statistics, U.S. Bureau of Labor Statistics, 31 Mar. 2022, https://www.bls.gov/oes/current/oes272042.htm?msclkid=04de54f5cefb11ec939659c9e6f4a2fc.
- Angier, Jess. “Taylor Swift Lyric - Exploration.” Kaggle, Kaggle, 5 Feb. 2019, https://www.kaggle.com/code/jangier/taylor-swift-lyric-exploration.
- Grolemund, Garrett, and Harley Wickham. “R For Data Science.” Welcome | R for Data Science, https://r4ds.had.co.nz/index.html.
- R Markdown Cheat Sheet - Rstudio. 14AD, https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf.
- “RColorBrewer Palettes.” Applied R Code, Dec. 2013, http://applied-r.com/rcolorbrewer-palettes/.
- Singh, Deepika. “Visualization of Text Data Using WordCloud.” Pluralsight, 23 Aug. 2019, https://www.pluralsight.com/guides/visualization-text-data-using-word-cloud-r.
- “Student Submissions- DACSS 601 Spring 2022 Class Blog.” Data Analytics and Computational Social Science, 25 Jan. 2022, https://dacss.github.io/DACSS601Spring2022/.
- #citation(“readr”) #citation(“ggplot2”) #citation(“tidyverse”) #citation(“tidyselect”) #citation(“dplyr”) #citation(“stringr”) #citation(“wordcloud”) #citation(“wordcloud2”) #citation(“RColorBrewer”) #citation(“tm”)