Overview of Final Project

The dataset I am using for my final project is a set taken from ‘Kaggle’, containing all of the songs and lyrics from Taylor Swift’s discography up until 2017. Due to the wide range of albums containing 20+ songs each, I will be comparing the lyrics of the first album ‘Taylor Swift’ and the most recent in the data set ‘Reputation’. Later in my research, I will introduce the middle albums in order to further emphasize the trends being shown throughout the album and the themes in each one individually.

Introduction:

Music is considered to be one of the universal languages. It is something that we can find in every culture, no matter where we look and it comes in all different forms. Over time, it has evolved drastically. It is a likely assumption that most people across the world listen to one form of music or another in their day-to-day life. However, the music that one listens to can range drastically. Focusing here in the United States, musicians are considered to be high-class celebrities, reaching or surpassing ranks of popularity in comparison to thespians or even, politicians. According to the U.S. Bureau of Labor Statistics, there are roughly 25,000 musicians and singers actively working as of May 2021, including other concentrations not limited to: schools, recording studios, and various religious organizations. At some point during careers as a music professional, you can expect for there to be change and growth within the music. Not every artist wants to sing the same types of music forever, and in some genres with more saturation, there is a need to differentiate from the norm in order to stand out more, especially important for independent and smaller artists.

For the focus of my research, we are going to be looking at Taylor Swift, someone who started off as a country singer before making a smooth and relatively favorable transition to mainstream pop music. This is someone who has achieved success in both genres. In terms of why the switch had been made, that is not for me to determine. However, we can look at the evolution between the first few albums and the last few in respect to:

The themes and trends that are shown.
What words appear more frequently in the songs?
What conclusions can we draw about the songs from the findings?
Can we reasonably conclude that there are large changes in her songs and lyrics from comparing the first and last albums?

Research Questions

What word shows up the most overall?
Are there any visible trends in words or topics in the chosen albums?
How have the lyrics changed over time?
What tone is visible within the selected albums?

Libraries and Packages

Show code

library(readr)
library(tidyverse)
library(tidyselect)
library(dplyr)
library(ggplot2)
library(tidytext)
library(stringr)
library(wordcloud)
library(wordcloud2)
library(RColorBrewer)
library(tm)

Locate the File

Show code

#Here, I am renaming the data set and providing its location within my PC. 

Swift_lyrics <- read_csv("C:/Users/Bryn Kruzlic/OneDrive/Desktop/DACSS601/taylor_swift_lyrics.csv")
View(Swift_lyrics)

Preview of the Data

Show code

#This is a preview of the data set and its respective columns. 
head(Swift_lyrics)

# A tibble: 6 x 7
  artist       album        track_title track_n lyric       line  year
  <chr>        <chr>        <chr>         <dbl> <chr>      <dbl> <dbl>
1 Taylor Swift Taylor Swift Tim McGraw        1 "He said ~     1  2006
2 Taylor Swift Taylor Swift Tim McGraw        1 "Put thos~     2  2006
3 Taylor Swift Taylor Swift Tim McGraw        1 "I said, ~     3  2006
4 Taylor Swift Taylor Swift Tim McGraw        1 "Just a b~     4  2006
5 Taylor Swift Taylor Swift Tim McGraw        1 "That had~     5  2006
6 Taylor Swift Taylor Swift Tim McGraw        1 "On backr~     6  2006

Show code

rmarkdown::paged_table(Swift_lyrics)

The variables within the data set include:

Show code

colnames(Swift_lyrics)

[1] "artist"      "album"       "track_title" "track_n"    
[5] "lyric"       "line"        "year"

The columns can be shown here as:

artist- character data (Taylor Swift)
album- character data (Taylor Swift, Fearless, etc.)
track_title- character data (Tim McGraw, etc. )
lyric- character data (He said the way..)
track_n- doubles data (1-etc.)
line- doubles data (1-etc.)
year- doubles data (2005-2017)]

Tidy Data

Based on the size of the data set, there are many factors that do not need to be included in our research. For example, articles in the lyrics such as “I”, “a”, “we”, “the”, etc. do not need to be included, as those words would clearly be the most popular out of any song based on its prevalence when forming sentences. These words will also not prove anything in terms of main themes and overall growth.

Show code

library(stringr)
library(dplyr)
library(tokenizers)

df <- data.frame(Swift_lyrics)
              
tidy_lyric <- Swift_lyrics %>%
  unnest_tokens(word, lyric) 

word_count <- tidy_lyric %>%
  group_by(track_title) %>%
  summarise(num_words = n()) %>%
  arrange(desc(num_words))

Visualizations

At this stage, I was able to use the ‘token’ feature to separate the ‘lyric’ column into individual words, in order to find trends within the lyrics. Since the data set contains multiple albums with 10+ tracks each, my research will filter out the middle albums to show the progression between the first studio album and the last one¹

The visualizations shown below can help the average reader understand what I am trying to show in my research. It is one thing to type out the most popular words in the songs, but it is a completely different thing to show which words are the most popular, how the words appear in comparison to other albums, and what can be drawn from it.

Word Count Distribution

Show code

word_count %>%
  ggplot() +
  geom_histogram(aes(x = num_words), fill = "#a458c4") +
  labs(title ="Word Count Overall", y="# of Songs", x="Words Per Song")

I am visualizing the word count per song which is a way for me to dissect the lyrics and word types within the songs later on in the project. We can conclude at this point that the word count per songs tend to be within the 300-500 range. With this information, we can see how many potential words we are working with and how often the words within these songs are repeated, and how the words may change.

Frequently Used Words: Overall

Show code

tidy_tslyric <- tidy_lyric %>% 
  anti_join(stop_words) %>%
  filter(nchar(word) > 3)
#This filters out words that are less than 3 characters, which filters out the articles and shorter words. 

tidy_tslyric %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(word = reorder(word,n)) %>%
  ggplot() +
  geom_col(aes(word, n), fill = "#FFDAB8") +
  labs(title="Frequently Used Words", x="Words", y="Times Used Overall") +
  coord_flip() +
  theme_minimal()

Show code

#This represents the words, on the y axis, with how many times they are said, on the x axis.

Here, I am visualizing the frequency of certain words within the discography and which ones appear the most. The small chunk of code at the top is being used to filter out articles such as “I”, “a”, “the”, etc. which are unimportant words in this research. We could reasonably conclude with no tidying at all that these articles and short words would appear the most frequently. I am interested in every word besides them.

Word Cloud: Another Visualization

A word cloud is a type of visualization that shows the frequency of words within a given set, with the most seen words appearing bigger in comparison to the words that are used the least.

Show code

select(tidy_tslyric, word)

# A tibble: 8,383 x 1
   word    
   <chr>   
 1 blue    
 2 eyes    
 3 shined  
 4 georgia 
 5 stars   
 6 shame   
 7 night   
 8 chevy   
 9 truck   
10 tendency
# ... with 8,373 more rows

Show code

# This shows us all of the words used across all of her songs, starting from the first album. 

docs <- Corpus(VectorSource(tidy_tslyric$word))
toSpace = content_transformer(
  function (x, pattern)
    gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")

dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_ts <- data.frame(word = names(f), freq=f)


# The size of each of the words represents how often it shows up in the lyrics. 
wordcloud(words = df_ts$word, 
          freq = df_ts$freq,
          min.freq = 5,
          max.words = 150,
          random.order = FALSE,
          rot.per = 0.45,
          colors = brewer.pal(8, "Pastel2"))

Show code

#This showns us a clearer, numerical version of the most frequently used words and how much they appear. 
rmarkdown::paged_table(df_ts)

Here, I am using the data frames from previous cleanings to show a different way of visualizing the most frequently used words. The head is showing us exactly how many times the words are used within the songs, which goes along with the previous visualization showing us the large jump between the most frequent words and the others.

Show code

#I am creating the data frames for each of the albums and their corresponding words. 
albums_compare <- select(tidy_tslyric, album, word)
row.names(albums_compare) <- paste("row", 1:nrow(albums_compare))
album1 <- filter(albums_compare, album == "Taylor Swift")
album2 <- filter(albums_compare, album == "Fearless") 
album3 <- filter(albums_compare, album == "Speak Now")
album4 <- filter(albums_compare, album == "Red") 
album5 <- filter(albums_compare, album == "1989") 
album6 <- filter(albums_compare, album == "reputation")

What are some of the trends we see based on the albums?

Since analyzing each word individually will take up too much time, we can use facet wraps and grids to easily compare multiple variables across one space, which can allow for easier comparisons of the trends overall (in this case, words and the frequency the words are used.)

Popular Words by Year

This section will contain all of the albums featured for more clarity as to the trends overtime. Each album is represented by a year, to show the changes (or lack thereof) overtime.

Show code

popwords <- tidy_tslyric %>%
group_by(album)%>%
  count(word, album, sort = TRUE) %>%
  slice(seq_len(10)) %>%
  ungroup() %>%
  arrange(album,n) %>%
  mutate(row = row_number()) 
  

library(ggplot2)
popwords %>%
  ggplot() +
  geom_col(aes(word, n), fill = "#83CC94", alpha =.5) +
  labs(x = "Words", y = "# Word is Sung", subtitle = "the longer the bar, the most frequently the word is used") +
  ggtitle("Popular Words by Album") +
  facet_wrap(~album, scales = "free") +
  coord_flip() +
  theme_classic()

Most Frequent Words- By Album

Album 1: Taylor Swift

Show code

# Album 1: Taylor Swift- Most Frequent Words
docs <- Corpus(VectorSource(album1$word))  
toSpace = content_transformer(
  function (x, pattern)
    gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")

dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_album1 <- data.frame(word = names(f), freq=f) 

wordcloud(words = df_album1$word, 
          freq = df_album1$freq,
          min.freq = 1,
          max.words = 50,
          random.order = FALSE,
          rot.per = 0.45,
          colors = brewer.pal(8, "Pastel2"))

Show code

rmarkdown::paged_table(df_album1)

The table here shows the frequency of the words in the album, “Taylor Swift” with the wordcloud to match.

Album 2: Fearless

Show code

#Album 2: Fearless- Most Frequent Words

docs <- Corpus(VectorSource(album2$word))  
toSpace = content_transformer(
  function (x, pattern)
    gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")

dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_album2 <- data.frame(word = names(f), freq=f) 

wordcloud(words = df_album2$word, 
          freq = df_album2$freq,
          min.freq = 1,
          max.words = 50,
          random.order = FALSE,
          rot.per = 0.45,
          scale=c(4,0.50), 
          colors = brewer.pal(8, "Pastel2"))

Show code

rmarkdown::paged_table(df_album2)

The table here shows the frequency of the words in the album, “Fearless” with the wordcloud to match.

Album 3: Speak Now

Show code

#Album 3: Speak Now- Most Frequent Words

docs <- Corpus(VectorSource(album3$word))  
toSpace = content_transformer(
  function (x, pattern)
    gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")

dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_album3 <- data.frame(word = names(f), freq=f) 

wordcloud(words = df_album3$word, 
          freq = df_album3$freq,
          min.freq = 1,
          max.words = 50,
          random.order = FALSE,
          rot.per = 0.45,
          scale=c(4,0.50), 
          colors = brewer.pal(8, "Pastel2"))

Show code

rmarkdown::paged_table(df_album3)

The table here shows the frequency of the words in the album, “Speak Now” with the wordcloud to match.

Album 4: Red

Show code

#Album 4: Red- Most Frequent Words 

docs <- Corpus(VectorSource(album4$word))  
toSpace = content_transformer(
  function (x, pattern)
    gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")

dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_album4 <- data.frame(word = names(f), freq=f) 

wordcloud(words = df_album4$word, 
          freq = df_album4$freq,
          min.freq = 1,
          max.words = 50,
          random.order = FALSE,
          rot.per = 0.45,
          colors = brewer.pal(8, "Pastel2"))

Show code

rmarkdown::paged_table(df_album4)

The table here shows the frequency of the words in the album, “Red” with the wordcloud to match.

Album 5: 1989

Show code

#Album 5: 1989- Most Frequent Words

docs <- Corpus(VectorSource(album5$word))  
toSpace = content_transformer(
  function (x, pattern)
    gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")

dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_album5 <- data.frame(word = names(f), freq=f) 

wordcloud(words = df_album5$word, 
          freq = df_album5$freq,
          min.freq = 1,
          max.words = 50,
          random.order = FALSE,
          rot.per = 0.45,
          scale=c(4,1.75), 
          colors = brewer.pal(8, "Pastel2"))

Show code

rmarkdown::paged_table(df_album5)

The table here shows the frequency of the words in the album, “1989” with the wordcloud to match.

Album 6: reputation

Show code

#Album 6: reputation- Most Frequent Words

docs <- Corpus(VectorSource(album6$word))  
toSpace = content_transformer(
  function (x, pattern)
    gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/")
docs1 = tm_map(docs, toSpace, "#")

dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
f <- sort(rowSums(matrix), decreasing = TRUE)
df_album6 <- data.frame(word = names(f), freq=f) 

wordcloud(words = df_album6$word, 
          freq = df_album6$freq,
          min.freq = 1,
          max.words = 50,
          random.order = FALSE,
          rot.per = 0.35,
          scale=c(4,1.75), 
          colors = brewer.pal(8, "Pastel2"))

Show code

rmarkdown::paged_table(df_album6)

The table here shows the frequency of the words in the album, “reputation” with the wordcloud to match.

Reflection: Questions and Answers

This process has been quite the learning experience for me. With the help of the tutorials provided in the Google Classroom, the first few assignments within the project were a lot easier for me to understand. The difficulties came along once my ambitions got the better of me. I strongly preferred the ability to work and learn with R Studio on our own, rather than having multiple weekly assignments dictating our progress with the software. By having more time to explore the system ourselves, I felt more comfortable asking for help and even more confident in my ability of ‘trial-and-error’ when first learning how to use ggplot.

I also really enjoyed the freedom of choosing our own data sets when first introduced to the project. I was graced with the opportunity to choose something of my own interests, which also conveniently allowed me to take a strong interest in what I was researching. The size of my dataset did not prove to be as big of an issue as I thought it would be, although I do wish I searched more for one that contained all of the albums. My classmates via the class blog and the internet also proved to be helpful tools along the way, as the frustrations I shared with coding are universal both in and outside of our classroom. Revisiting the tutorials and chapters were also keys to my success, as this allowed me to compare the strings and vectors I was creating with a textbook example. These examples, along with my classmates, were used and treated as models going forward from HW1. I also appreciated Jason O’Connell’s example and solution to code folding that made my assignments look all that much neater. Although I wish I utilized the Slack server more, the questions and comments within our group came to be very useful in times of need.

My biggest challenges came from the type of data I was using. Most of my peers chose numerical data to analyze from the get-go and a lot of the work I had to focus on was turning my data into numeric (i.e. separating the lyrics into individual words and then creating a column for the frequencies of the words.) In some instances, I looked to the internet for solutions on how to better sculpt my work into something more digestible, as placing a bunch of words on a graph would fail to justify my initial research questions. The visualizations still remain the hardest part of the process to date, and at some points, I can humbly admit I bit off more than I could ever chew. Not only did I discover (and learn via tutorials) how to token my data into a way that the processor could graph, I came across a myriad of problems that never even occurred to me before starting the assignment, specifically eliminating the common English words and articles like “I”, “a”, “the”, etc. from my research into determining her most used words in songs. In lieu of all this, I would still not take anything back. I think it was important for me to push myself outside of my comfort zone, regardless of the errors and mishaps that occurred, and are still occurring.

What questions are left unanswered?

I think at this point, the clarification between the albums are not clear yet. Due to various struggles with the visualizations in the first place, these ggplots are referencing her discography as a whole and not based album-by-album. As I advance in my skills further, I will be able to separate the data by the two albums and re-convey the same information shown above but with a tighter focus.

Since this first question, I was better able to manipulate the data and separate the data frame based on the provided albums. However, my ability to compare the albums side by side is still in progress. If I were to continue the project further, some different routes I would take would be continuing to analyze the albums in terms of their sentiment: example found here- https://www.tidytextmining.com/sentiment.html

However, at this point in my coding career, being very new to R Studio, I would be hesitant to attempt something of this nature this early on. The thought did occur to me during the process but I am not confident in my ability to successfully replicate the example without significant guidance.

What might be unclear?

I think most people will be able to digest the data as is, but those who are unfamiliar with the artist might not care about the importance of the words, nor the frequency in which the words appear in songs. I figured a visualization such as a word cloud would be easier to digest than something like a box plot, which would provide a less clear visualization of my data.

How could I improve the visualizations?

I think the spacing of the graphs and the organization overall could be much improved. I am still working on the scales of the y-axis to make it look cleaner and easily understandable but at this point of the project, I am working with the defaults.

Reflecting Further on my Research

I am still working on a few last minute visualizations, which would include word clouds and other specifics in regards to the popular words between the albums. I am attempting to do a visualization for each or share multiple diagrams on one grid.
I am hoping to make the overall scheme of my presentation more cohesive in terms of understanding: what it is that the reader is looking at, what each of the graphs mean, and what conclusions can be taken away.
Since the working draft, I was able to separate the data frame by each individual album and show the most frequent words for each.

Conclusion

I think from the research I was able to gather and analyze, we can conclude that there is growth between the albums in terms of word usage but we cannot definitively conclude that the growth is seen as significant. We can see a lot of the same words used overtime and even without the context of the lyrics, we can reasonably conclude the emotions conveyed through each of the songs have stayed consistent over time.

Another thing I encountered that was tedious was my constant and frequent creation of different data frames. In addition to programming, I tend to hand write a lot of my thoughts and sequences out, but even with the amount of data frames I have made, it can be hard to follow and keep track of. I hope in my continuation of the DACSS classes, I am able to learn more about how to either: keep track of them all or, learn to create less and instead, use the ones I had before.

Research Questions

What word shows up the most overall?
Are there any visible trends in words or topics in the chosen albums?
How have the lyrics changed over time?
What tone is visible within the selected albums?

Answers to Research Questions

What word/words show up the most overall?

Based on the third visualization, the words that show up the most would be “love”, “time”, “baby”, “feel”, and “stay”. These words, despite the lack of context for the sentiment, represent an overall feeling of joy, intimacy, devotion, etc. Some of the words by themselves may be represented more ambiguously, however, in the context of the visualizations, can be seen as overwhelmingly positive.

Are there any visible trends in words or topics in the chosen albums?

It is not surprising to find out the trends of words overtime have remained relatively consistent. Themes of longing for someone, youth, dancing, and the idea of ‘forever’ appear multiple times throughout each album. From an objective standpoint, we can represent:

Albums 1 and 2 as the ‘longing’ albums in which love is seen as distant and achievable with strong tones of ‘hope’, ‘feel/feeling’, and ‘time.’
Albums 3 and 4 as the ‘present/realistic’ albums in which the words change slightly, alluding to the strong tones of ‘time’, ‘trouble’, ‘grow’, and ‘someday’ that may present love, or even heartbreak, in a less favorable light. ²
Albums 5 and 6 as the ‘transition’ albums in which words such as ‘stay’, ‘waiting’, and ‘getaway’ and other locations make strong appearances, potentially alluding to transitions within the singers’ life, straying away from some of the themes in the second set of albums and lining up more closely with the first set.

How have the lyrics changed over time?

From first glance, there is not a huge shift in lexical distinctiveness between the words over time. We can reasonably assume that love plays a strong role in the inspiration for the albums, and from first glance, is not represented negatively. Only some of the words remain ambiguous enough to hint to other emotions.

What tone is visible within the selected albums?

The comparison of the first album ‘Taylor Swift’ and the last album ‘reputation’ showcase a lot of the same words being repeated. From the context clues, we can reasonably conclude the same emotions exist in both the first album and the last one. However, we do notice the lack of the word “love” and “heart” seen in the last album.

Citations

“27-2042 Musicians and Singers.” U.S. Bureau of Labor Statistics, U.S. Bureau of Labor Statistics, 31 Mar. 2022, https://www.bls.gov/oes/current/oes272042.htm?msclkid=04de54f5cefb11ec939659c9e6f4a2fc.

Angier, Jess. “Taylor Swift Lyric - Exploration.” Kaggle, Kaggle, 5 Feb. 2019, https://www.kaggle.com/code/jangier/taylor-swift-lyric-exploration.

Grolemund, Garrett, and Harley Wickham. “R For Data Science.” Welcome | R for Data Science, https://r4ds.had.co.nz/index.html.

R Markdown Cheat Sheet - Rstudio. 14AD, https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf.

“RColorBrewer Palettes.” Applied R Code, Dec. 2013, http://applied-r.com/rcolorbrewer-palettes/.

Singh, Deepika. “Visualization of Text Data Using WordCloud.” Pluralsight, 23 Aug. 2019, https://www.pluralsight.com/guides/visualization-text-data-using-word-cloud-r.

“Student Submissions- DACSS 601 Spring 2022 Class Blog.” Data Analytics and Computational Social Science, 25 Jan. 2022, https://dacss.github.io/DACSS601Spring2022/.

#citation(“readr”) #citation(“ggplot2”) #citation(“tidyverse”) #citation(“tidyselect”) #citation(“dplyr”) #citation(“stringr”) #citation(“wordcloud”) #citation(“wordcloud2”) #citation(“RColorBrewer”) #citation(“tm”)

This data set was created prior to 2019, in which she has since released 3 more albums and two re-recordings.↩︎
Since we cannot see the full content and analyze lyric-by-lyric ourselves, we can look at the words objectively to discuss the mood, which in some cases, might not always be accurate.↩︎