Taylor Swift Data Exploration Project, Take 3

I created a pdf of all the lyrics from Taylor Swift’s Red (Taylor’s Version) album, including all of the From the Vault bonus songs.

The following were stripped out:

All vocalizations that are not words
All notes identifying the singer (for collaborations)
All song part labels (e.g., “verse” or “outro”)

First, I’ll load in the necessary libraries.

library(tidyverse)
library(quanteda)
library(quanteda.textplots)
library(readtext)

# Get the data directory from readtext
DATA_DIR <- system.file("extdata/", package = "readtext")

Next, I’ll read in the pdf.

## Read in lyrics pdf files
tswift <- readtext::readtext("~/DACCS R/Text as Data/TS Data/Red_stripped_down.pdf")

#create corpus
corpus_red <- corpus(tswift)

Now I will attempt to create a wordcloud:

dfm_inaug <- corpus_subset(corpus_red) %>% 
    dfm(remove = stopwords('english'), remove_punct = TRUE) %>%
    dfm_trim(min_termfreq = 10, verbose = FALSE)
set.seed(100)
textplot_wordcloud(dfm_inaug)

Trying a different way of doing this:

I’m going to follow a tutorial I found on-line].

#Load required packages
library(wordcloud)
library(wordcloud2)
library(RColorBrewer)
library(tm)
library(tidyverse)
library(pdftools)

Using pdftools, read in the pdf again and create corpus.

#read in pdf
red_pdf <- pdf_text("/Users/lissie/DACCS R/Text as Data/TS Data/Red_stripped_down.pdf")

#create corpus
corpus_red2 <- Corpus(VectorSource(red_pdf))

Next, we’ll clean the data using tm:

corpus_red2 <- corpus_red2 %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
corpus_red2 <- tm_map(corpus_red2, content_transformer(tolower))
corpus_red2 <- tm_map(corpus_red2, removeWords, stopwords("english"))

Create frequency matrix:

dtm <- TermDocumentMatrix(corpus_red2) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

Create word cloud:

set.seed(1234) # for reproducibility 
wordcloud(words = df$word, freq = df$freq, min.freq = 5,           max.words=200, random.order=FALSE, scale=c(3.5,0.25), rot.per=0.35,            colors=brewer.pal(8, "Dark2"))

Or a different version:

wordcloud2(data=df, size=1.6, color='random-dark')

Now I’m going to try and create a wordcloud with phrases instead of individual words:

text <- readLines("/Users/lissie/DACCS R/Text as Data/TS Data/red_lyrics.txt")


# freq = 1 adds a columns with just 1's for every value.
my_data <- data.frame(text = text, freq = 1, stringsAsFactors = FALSE)

# aggregate the data.    
my_agr <- aggregate(freq ~ ., data = my_data, sum)

wordcloud(words = my_agr$text, freq = my_agr$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"), scale = c(10, .5))

(Note: a large number of sentences were not able to fit but for the sake publishing, I hid the warnings)