Report of Data Visualization

Taylor Swift: the music industry

Taylor Alison Swift (December 13, 1989) is an American singer-songwriter. Her discography spans multiple genres, and her narrative songwriting—often inspired by her personal life—has received critical praise and widespread media coverage. Born in West Reading, Pennsylvania, Swift was raised in a Christmas Tree Farm. She relocated to Nashville, Tennessee, at the age of 14 to pursue a career in country music. She signed a recording deal with Big Machine Records in 2005, before releasing her eponymous debut studio album in 2006.

Eight of her songs have topped the Billboard Hot 100, and her concert tours are some of the highest-grossing in history. She has received 11 Grammy Awards (including three Album of the Year wins), an Emmy Award, 34 American Music Awards (the most for an artist), 25 Billboard Music Awards (the most for a woman) and 56 Guinness World Records, among other accolades. She featured on Rolling Stone’s 100 Greatest Songwriters of All Time (2015), Billboard’s Greatest of All Time Artists (2019), the Time 100 and Forbes Celebrity 100 rankings. Named Woman of the 2010s Decade by Billboard and Artist of the 2010s Decade by the American Music Awards, Swift has been regarded as a pop icon due to her influential career, philanthropy, and advocacy of artists’ rights and women’s empowerment in the music industry.

The aim of this analysis is to have a more in dept look at her discography and songwriting, with a special attention to the results achieved in terms of Spotify streams and pure sales.

The Data

The data sets used in this analysis are:

taylor, which contains characteristics and lyrics of all officially released Taylor Swift songs that are included in her first 9 studio albums. It is composed of 156 observations (the songs) and 34 variables. The ones used in this analysis are:
- album_name: the name of the album;
- album_release: the date the album was released, in the ISO-8601 format (YYYY-MM-DD);
- track_name: the name of the song;
- genre1: the main genre of the song;
- rating: explicit if the song contains explicit lyrics or is about “strong” themes, else it is clean;
- writing: self-written if the song was written by Taylor Swift alone, else co-written;
- streams: the number of streams the song received on Spotify up to 24th April 2022;
- danceability: how suitable a track is for dancing. 0.0 = least danceable, 1.0 = most danceable;
- energy: perceptual measure of intensity and activity. 0.0 = least energy, 1.0 = most energy;
- loudness: loudness of track in decibels (dB), averaged across the track;
- speechiness: the presence of spoken words in a track. Values above 0.66 indicate that the track is probably made entirely of spoken words. Values between 0.33 and 0.66 indicate both music and speech. Values less than 0.33 indicate the track is probably music or other non-speech tracks;
- acousticness: confidence that the track is acoustic. 0.0 = low confidence, 1.0 = high confidence;
- instrumentalists: confidence that the track is an instrumental track (i.e., no vocals). 0.0 = low confidence, 1.0 = high confidence;
- liveness: confidence that the track is a live recording (i.e., an audience is present). 0.0 = low confidence, 1.0 = high confidence;
- valence: musical positiveness conveyed by the track. 0.0 = low valence (e.g., sad, depressed, angry), 1.0 = high valence (e.g., happy, cheerful, euphoric);
- tempo: estimated tempo of the track in beats per minute (BPM);
- lyrics: lyrics of each song.
albumsales, which contains information about the sales of each album in all the countries in all the years. It is made of 8280 rows and 6 variables:
- album_name: the name of the album;
- Album: the name of the album;
- Year: the year in which the sales were registered;
- Continent: the continent of the country in which the sales were registered;
- Country: the country in which the sales were registered;
- Sales: the sales of pure copies of the albums;
- Pure Sales Index: an index equal to 100 if the album has sold more than 100k pure units in a country during a given year while if an album has sold less than 100k pure units, it is equal to that number of sold units divided by 1000.
album1989dailystreams, which contains information about the streams of the album 1989 in March and April. It consists of 71 rows and 4 columns. The considered variables are the following:
- Week: the number of the week;
- Day_week: the day of the week;
- Daily_streams: the streams that the album received in a given day in a given week.

General Overview

The musical genres

Taylor Swift has been on the musical scenes for more than 15 years now, so let’s start this analysis by looking at the main musical genres that she has explored in her career.

# Frequency table
freq_table <- taylor %>% 
              group_by(genre1) %>%           # group by the genre1
              summarize(Abs_freq=n()) %>%   # absolute frequencies
              mutate(Rel_freq=round(Abs_freq/sum(Abs_freq),digits=4), # relative frequencies
              Perc_freq=Rel_freq*100) %>% # percentages 
              mutate(Lab_text=paste0(Perc_freq,"%")) # paste0 collates strings, thus Lab_text is a new chr variable

# Custom palette of colors
my_palette = c("#FBD35D", "#75DBB2", "#FF6BAE")

# Donut chart
library(RColorBrewer)
freq_table %>% 
  ggplot(aes(x=1, y=Perc_freq,fill=as.factor(genre1))) +
  geom_bar(stat="identity") + # stacked percentage bar chart using stat="identity" so geom_bar does not compute frequencies but uses the data table value as they are
  geom_text(aes(label = Lab_text),position = position_stack(vjust = 0.5)) +
  xlim(c(-0.5,1.5))+ # donut chart
  coord_polar(theta="y") + 
  labs(fill = "Genre", title = "Taylor Swift's discography", subtitle = "by genres") + 
  scale_fill_manual(values = my_palette) +
  theme_void()

From this donut chart, we can see that 45.51% of Swift’s discography is made of pop songs, 32.69% is made of country songs and the remaining 21.79% is made of alternative songs.

Albums and type of writing

Let’s see more in particular which are the albums that are respectively alternative, country or pop and in which way these genres are associated to the types of writing.

# Create a vriable that contains the name of each album together with the year of release
taylor <- taylor %>% 
  mutate(release_year = year(ymd(taylor$album_release)), .after = album_release)
year_parenthesis = paste("(",taylor$release_year,")",sep = "")
taylor <- taylor %>%
  mutate(album_year = paste(album_name, year_parenthesis, sep = " "), .after = album_name)
taylor$album_year <- as.factor(taylor$album_year)

# Data preparation
data_for_allu <- taylor %>% 
  dplyr::group_by(album_name, album_year, release_year, genre1,rating,writing) %>% 
  dplyr::summarise(Freq=n(),.groups = "drop") %>% 
  arrange(release_year, desc(album_name)) 

# Change order of the levels of album_year
data_for_allu$album_year <- factor(data_for_allu$album_year, levels = unique(data_for_allu$album_year)) 

# Custom palette of colors
my_palette1 = c("#E6AB02", "#1B9E77", "#E7298A")

# Alluvial plot
library(ggalluvial)
ggplot(data_for_allu, aes(y = Freq,
                          axis1 = genre1,
                          axis2 = album_year,
                          axis3 = rating,
                          axis4 = writing)) + 
  geom_alluvium(aes(fill = genre1), show.legend = FALSE,
                width = 1/10, curve_type = "quintic")+ 
  geom_stratum(width = 1/8, 
               fill = "grey30", color = "grey")+
  geom_label(stat = "stratum", aes(label = after_stat(stratum))) +
  scale_x_discrete(limits = c("Genre","Album","Rating", "Writing"), expand = c(.05, .05)) + 
  labs(title = "The albums",
       subtitle = "by genre and writing",
       y="")+
  scale_fill_manual('Genre',values = my_palette1)+
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.background = element_rect(fill = "white", color = "white"),
        panel.background = element_rect(fill = "white", color = "white"),
        plot.title = element_text(
        size = 15))

We can see that the country albums are Taylor Swift (2006), Fearless (2008) and Speak Now (2010) while, the pop ones are Red (2012), 1989 (2014), Reputation (2017) and Lover (2019) while the alternative ones are Folklore (2020) and Evermore (2020). A first interesting aspect to point out is that the singer has explored each genre with different consecutive albums before switching to the next one, to consolidate her ability in that genre. Moreover, only with her most recent alternative albums (at the age of 31), she has made “abuse” of explicit lyrics while the country albums are completely clean and only a very small number of pop songs are explicit. However, all the explicit songs are co-written. This probably means that she doesn’t write explicit lyrics herself but her co-writers do and she only accepts to put them in her songs, not caring about the reputation of the “american good girl” that she had in the past.

Spotify features: Principal Component Analysis

There are a lot of numerical variables that describe Taylor’s songs and her way of making country, pop and alternative music. It may be dispersive to look at all of them. That’s why I want to try to find a lower number of variables that can represent the characteristics of her songs at the best, without losing too much information. To reach this aim I have conducted a Principal Component Analysis that creates new variables, called principal components, which are a linear combinations of the old variables. Of course, we must firstly check if and how these numerical variables are correlated. We can do this by using a correlogram.

Correlogram

# Data preparation
taylor_for_pca <- taylor %>%
  select(danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,tempo,valence)

# Correlation matrix
correlations<-as.matrix(round(cor(taylor_for_pca),2)) 
colors = c("#92573C", "#C6B69C", "#D8D8CF", "#846578", "#5D4E5D")

# Correlogram
corrplot.mixed(correlations,order = "hclust",lower = "number",lower.col ="#800080",number.cex=0.80,upper = "ellipse",upper.col = colors, diag = "u",tl.pos = "lt",tl.col="black")

From the correlogram, the positive correlations that stand out the most are those between:

energy and loudness;
danceability and valence.

Instead, the negative correlations that stand out the most are those between:

acousticness and energy/loudness;
danceability and speechiness/instrumentalness.

Danceability and Valence

Since we observe a very high correlation of 0.78 between danceability and valence, let’s have a more in depth look at it, by showing also the most common values of these two variables in Taylor’s songs.

library(viridis)
# 2D Kernel Density Estimate
taylor %>% ggplot(aes(x=danceability,y=valence))+
  stat_density_2d(aes(fill = ..density..), 
                  geom = "raster", 
                  contour = FALSE)+
  theme_minimal()+
  xlab("Danceability")+
  ylab("Valence")+
  labs(title="Correlation between Danceability and Valence", subtitle="in Taylor Swift discography")+
  scale_fill_viridis("Density",option = "magma", limits = c(0,15))

This plot better displays the positive relationship between the two variables that was observed in the correlation matrix: as danceability increases, valence increases too (and viceversa). Moreover, it is clear from the plot that most of Taylor’s songs have high values for these variables, both between 0.5 and 0.7.

The fact that these two variables are very highly correlated means that they could be both explained by the same component, but let’s proceed step by step.

Scree plot

The first step is to select the dimensions (the PCs) for visualizing the data. We can do this by analyzing the so called scree plot, i.e. the bar plot associated with the percentage of variability explained by each new direction (it is related to the eigenvalues), ordered from largest to smallest.

# PCA results
library(data.table)
PCA_res<-FactoMineR::PCA(taylor_for_pca,graph = F)
knitr::kable(rownames_to_column(as.data.frame(PCA_res$eig), var = "Eig."))

Eig.	eigenvalue	percentage of variance	cumulative percentage of variance
comp 1	2.8203185	31.336872	31.33687
comp 2	2.4384453	27.093836	58.43071
comp 3	0.9473284	10.525872	68.95658
comp 4	0.9318970	10.354411	79.31099
comp 5	0.6646716	7.385240	86.69623
comp 6	0.4633475	5.148306	91.84454
comp 7	0.3579932	3.977702	95.82224
comp 8	0.2111140	2.345712	98.16795
comp 9	0.1648844	1.832049	100.00000

# Screeplot
library(factoextra)
library(plotly)
ggplotly(fviz_screeplot(PCA_res, ncp=9, barfill="#841283", barcolor = "white"))

In particular, we can see that the first principal component accounts for the 31.34% of variability, the second principal component for the 27.09% of variability, and so on. Originally we had 10 variables, each variable was able to describe around 11% of information and so each couple of variables was able to describe the 22% of information. Now, two new variables variables, the first and the second principal components, account for almost the triple of that variability (58.43 %), which is more than the half of the total variability. Following the elbow rule we can decide to choose, indeed, only these two components. A projection of the data onto the first two PCs will give us a good way to visualize the data in a low-dimensional linear subspace.

Circle of correlation

After deciding how many principal components we want to store, we can see how much each variable has affected each new dimension.

# Circle of correlation
circle <- fviz_pca_var(PCA_res,col.var = "contrib",axes = c(1,2),
             gradient.cols = c("#E6D1FE","#B067FF","#2E0160"), title="Variables - PCA", legend.title = "Contribution") + 
  theme_minimal()
ggplotly(circle)

The variables are indicated by arrows drawn from the origin and their projected values on each PC show how much weight they have on that PC.

We can see that danceability, valence, speechiness and instrumentalness strongly influence the first principal component. If a set of variables is highly correlated with a principal component, this principal component is reliable in describing that set of variables. So, PC1 could also be seen as a measure of the rhythm of a song.

Instead, PC2 is strongly influenced by loudness, energy and acousticness. This means that PC2 could be a good indicator of sound intensity.

Moreover, the angles between the vectors tell us how variables correlate with one another, confirming what we saw in the correlation matrix:

When two vectors are close, forming a small angle, the two variables they represent are positively correlated. Example: loudness and energy.
If they meet each other at 90°, they are not likely to be correlated. Example: loudness and valence (they are almost perpendicular).
When they diverge and form a large angle (close to 180°), they are negative correlated. Example: instrumentalness and valence.

What about the relationship between the songs and the new components? We can show their positions with respect to these new dimensions of reference.

Plot of individuals

# Plot of individuals
fviz_pca_ind(PCA_res, col.ind="blue",label="red",alpha.ind = 2, habillage = taylor$genre1, palette = "Dark2", addEllipses=TRUE, ellipse.level=0.95, legend.title = "Genre") +
  theme_minimal()

The plot shows the projection of the data onto the span of the principal components. We can use this plot to assess the data structure and detect clusters, outliers, and trends. Groupings of data on the plot may indicate two or more separate distributions in the data.

Songs that have values close to the average appear near the origin of the plot. Points that are further out from the rest of the points are outliers.

Moreover, songs that have similar values for the same variables are now clustered together while songs that are far from each other have different values for these variables.

Biplot

We can also represent both the variables and the songs on the following biplot, where the arrows represent the original variables and the points the songs.

# Biplot
fviz_pca_biplot(PCA_res,
                col.var = "black",
                label = "var",
                habillage = taylor$genre1,
                palette = "Dark2",
                legend.title = "Genre")+
  theme_minimal()

The closer the points are to the x-axis, the higher the values of the variables related to PC1 are with respect to the values of the variables related to PC2. Instead, the closer the points are to the y-axis, the higher the values for the variables related to PC2 are with respect to the values of the variables related to PC1.

Generally speaking, on average we can say that:

pop songs are characterized by high values of danceability and valence while having lower values for the acousticness;
country songs are characterized by high values energy and loudness while having lower values for the acousticness, just like pop songs ;
alternative songs, instead, are mostly acoustic and not danceable or energetic.

Text Analysis

Now let’s shift the attention to the lyrics of the songs. Taylor Swift is known not for having an impressive voice, but for her brilliant songwriting and storytelling to which people can relate. In 2010 she was named country songwriter of the year at the BMI Awards, becoming the youngest person to win the award at age 20. In the same year she received the prestigious Hal David Starlight Award at the Songwriters Hall of Fame Awards, joining legends like Alicia Keys, John Mayer and John Legend. Most recently, she was awarded with the 2021 National Music Publishers’ Association’s Songwriter Icon Award.

How much lyrics does Taylor Swift write?

During the 1989 World Tour, Swift thanked her fans for singing along her and for knowing all the lyrics of her songs. She knew that it took an enormous effort because she writes a lot of lyrics.

# Remove alternative versions of songs
new_taylor <- slice(taylor, -c(15,31,71,72,73))

# Corpus of all the lyrics
corpus_taylor_lyrics <- corpus(new_taylor$lyrics, docnames = c(new_taylor$track_name))

# Add album names and years to the corpus
docvars(corpus_taylor_lyrics, "Year") <- year(ymd(new_taylor$album_release)) # year of each album
docvars(corpus_taylor_lyrics, "Album") <- new_taylor$album_name # album names

# Summary of the corpus 
summary_corpus <- summary(corpus_taylor_lyrics,
                    n = 200,
                    what = "word",
                    tolower = TRUE, # convert texts to lower case before counting types
                    remove_punct = TRUE, # remove punctuation
                    remove_separators = TRUE, # remove separators
                    remove_symbols = TRUE, # remove symbols
                    split_hyphens = TRUE ) # splits words connected by -
summary_corpus <- summary_corpus %>%
  rename("Song" = Text, "Lines" = Sentences) # rename variables

# Reorder the columns
summary_corpus<-summary_corpus[, c("Song","Album","Year","Types","Tokens","Lines")]

DT::datatable(summary_corpus)

# Distribution of word count
ggplot(summary_corpus, aes(x=Tokens)) + # the density has one variable
  geom_density(color="black", fill="#1AC03B") +
  geom_vline(aes(xintercept=mean(summary_corpus$Tokens)), linetype = "dashed") +
  labs(title = "Distribution of tokens", subtitle = "in Taylor Swift songs") + 
  xlab("Number of words") +
  ylab("Density") +
  theme(
    plot.background = element_rect(fill = "#fbf9f4", color = "#fbf9f4"),
    panel.background = element_rect(fill = "#fbf9f4", color = "#fbf9f4"),
    legend.key = element_rect(fill = NA, color = NA),
    legend.background = element_blank()
  )

The table and the plot show that, indeed, her songs contain a lot of lyrics. All of her songs present more than 150 words and, actually, some of them exceed 500 words, even though the greatest part of her songs contain between 300 and 450 words. As a result, the average number of words per song is around 375, which is not low at all.

It’s true that the lyrics, in general, are a lot, but let’s see if there has been some changes over time.

# Taylor Swift
taylorswift_lyrics <- new_taylor %>%
  filter(album_name=="Taylor Swift") 

taylorswift_lyrics <-  paste(unlist(taylorswift_lyrics$lyrics), collapse ="     \\\\.........\\\\     ")

corpus_taylorswift_lyrics <- corpus(taylorswift_lyrics)
docvars(corpus_taylorswift_lyrics, "Year") <- 2006
docvars(corpus_taylorswift_lyrics, "n_of_songs") <- 15

summary_corpus_taylorswift_lyrics <- summary(corpus_taylorswift_lyrics,
                    n = 200,
                    what = "word",
                    tolower = TRUE, # convert texts to lower case before counting types
                    remove_punct = TRUE, # remove punctuation
                    remove_separators = TRUE, # remove separators
                    remove_symbols = TRUE, # remove symbols
                    split_hyphens = TRUE ) # splits words connected by -)

# Fearless
fearless_lyrics <- new_taylor %>%
  filter(album_name=="Fearless") 

fearless_lyrics <- paste(unlist(fearless_lyrics$lyrics), collapse ="     \\\\.........\\\\     ")

corpus_fearless_lyrics <- corpus(fearless_lyrics)
docvars(corpus_fearless_lyrics, "Year") <- 2008
docvars(corpus_fearless_lyrics, "n_of_songs") <- 19

summary_corpus_fearless_lyrics <- summary(corpus_fearless_lyrics,
                    n = 200,
                    what = "word",
                    tolower = TRUE, # convert texts to lower case before counting types
                    remove_punct = TRUE, # remove punctuation
                    remove_separators = TRUE, # remove separators
                    remove_symbols = TRUE, # remove symbols
                    split_hyphens = TRUE ) # splits words connected by -)

# Speak Now
speaknow_lyrics <- new_taylor %>%
  filter(album_name=="Speak Now") 

speaknow_lyrics <- paste(unlist(speaknow_lyrics$lyrics), collapse ="     \\\\.........\\\\     ")

corpus_speaknow_lyrics <- corpus(speaknow_lyrics)
docvars(corpus_speaknow_lyrics, "Year") <- 2010
docvars(corpus_speaknow_lyrics, "n_of_songs") <- 17

summary_corpus_speaknow_lyrics <- summary(corpus_speaknow_lyrics,
                    n = 200,
                    what = "word",
                    tolower = TRUE, # convert texts to lower case before counting types
                    remove_punct = TRUE, # remove punctuation
                    remove_separators = TRUE, # remove separators
                    remove_symbols = TRUE, # remove symbols
                    split_hyphens = TRUE ) # splits words connected by -)

# Red
red_lyrics <- new_taylor %>%
  filter(album_name=="Red") 

red_lyrics <- paste(unlist(red_lyrics$lyrics), collapse ="     \\\\.........\\\\     ")

corpus_red_lyrics <- corpus(red_lyrics)
docvars(corpus_red_lyrics, "Year") <- 2012
docvars(corpus_red_lyrics, "n_of_songs") <- 22

summary_corpus_red_lyrics <- summary(corpus_red_lyrics,
                    n = 200,
                    what = "word",
                    tolower = TRUE, # convert texts to lower case before counting types
                    remove_punct = TRUE, # remove punctuation
                    remove_separators = TRUE, # remove separators
                    remove_symbols = TRUE, # remove symbols
                    split_hyphens = TRUE ) # splits words connected by -)

# 1989
ninteen89_lyrics <- new_taylor %>%
  filter(album_name=="1989") 

ninteen89_lyrics <- paste(unlist(ninteen89_lyrics$lyrics), collapse ="     \\\\.........\\\\     ")

corpus_ninteen89_lyrics <- corpus(ninteen89_lyrics)
docvars(corpus_ninteen89_lyrics, "Year") <- 2014
docvars(corpus_ninteen89_lyrics, "n_of_songs") <- 16

summary_corpus_ninteen89_lyrics <- summary(corpus_ninteen89_lyrics,
                    n = 200,
                    what = "word",
                    tolower = TRUE, # convert texts to lower case before counting types
                    remove_punct = TRUE, # remove punctuation
                    remove_separators = TRUE, # remove separators
                    remove_symbols = TRUE, # remove symbols
                    split_hyphens = TRUE ) # splits words connected by -)

# reputation
reputation_lyrics <- new_taylor %>%
  filter(album_name=="Reputation") 

reputation_lyrics <-  paste(unlist(reputation_lyrics$lyrics), collapse ="     \\\\.........\\\\     ")

corpus_reputation_lyrics <- corpus(reputation_lyrics)
docvars(corpus_reputation_lyrics, "Year") <- 2017
docvars(corpus_reputation_lyrics, "n_of_songs") <- 15

summary_corpus_reputation_lyrics <- summary(corpus_reputation_lyrics,
                    n = 200,
                    what = "word",
                    tolower = TRUE, # convert texts to lower case before counting types
                    remove_punct = TRUE, # remove punctuation
                    remove_separators = TRUE, # remove separators
                    remove_symbols = TRUE, # remove symbols
                    split_hyphens = TRUE ) # splits words connected by -)

# Lover
lover_lyrics <- new_taylor %>%
  filter(album_name=="Lover")

lover_lyrics <-  paste(unlist(lover_lyrics$lyrics), collapse ="     \\\\.........\\\\     ")

corpus_lover_lyrics <- corpus(lover_lyrics)
docvars(corpus_lover_lyrics, "Year") <- 2019
docvars(corpus_lover_lyrics, "n_of_songs") <- 18

summary_corpus_lover_lyrics <- summary(corpus_lover_lyrics,
                    n = 200,
                    what = "word",
                    tolower = TRUE, # convert texts to lower case before counting types
                    remove_punct = TRUE, # remove punctuation
                    remove_separators = TRUE, # remove separators
                    remove_symbols = TRUE, # remove symbols
                    split_hyphens = TRUE ) # splits words connected by -)

# Folklore

folklore_lyrics <- new_taylor %>%
  filter(album_name=="Folklore")

folklore_lyrics <- paste(unlist(folklore_lyrics$lyrics), collapse ="     \\\\.........\\\\     ")

corpus_folklore_lyrics <- corpus(folklore_lyrics)
docvars(corpus_folklore_lyrics, "Year") <- 2020
docvars(corpus_folklore_lyrics, "n_of_songs") <- 17

summary_corpus_folklore_lyrics <- summary(corpus_folklore_lyrics,
                    n = 200,
                    what = "word",
                    tolower = TRUE, # convert texts to lower case before counting types
                    remove_punct = TRUE, # remove punctuation
                    remove_separators = TRUE, # remove separators
                    remove_symbols = TRUE, # remove symbols
                    split_hyphens = TRUE ) # splits words connected by -)

# evermore
evermore_lyrics <- new_taylor %>%
  filter(album_name=="Evermore")

evermore_lyrics <- paste(unlist(evermore_lyrics$lyrics), collapse ="     \\\\.........\\\\     ")

corpus_evermore_lyrics <- corpus(evermore_lyrics)
docvars(corpus_evermore_lyrics, "Year") <- 2020
docvars(corpus_evermore_lyrics, "n_of_songs") <- 17

summary_corpus_evermore_lyrics <- summary(corpus_evermore_lyrics,
                    n = 200,
                    what = "word",
                    tolower = TRUE, # convert texts to lower case before counting types
                    remove_punct = TRUE, # remove punctuation
                    remove_separators = TRUE, # remove separators
                    remove_symbols = TRUE, # remove symbols
                    split_hyphens = TRUE ) # splits words connected by -)

# folklore & evermore
folklore_evermore_lyrics <- new_taylor %>%
  filter(album_name=="Folklore" | album_name=="Evermore") 

folklore_evermore_lyrics <- paste(unlist(folklore_evermore_lyrics$lyrics), collapse ="     \\\\.........\\\\     ")

corpus_folklore_evermore_lyrics <- corpus(folklore_evermore_lyrics)

docvars(corpus_folklore_evermore_lyrics, "Year") <- 2020
docvars(corpus_folklore_evermore_lyrics, "n_of_songs") <- 34

summary_corpus_folklore_evermore_lyrics <- summary(corpus_folklore_evermore_lyrics,
                    n = 200,
                    what = "word",
                    tolower = TRUE, # convert texts to lower case before counting types
                    remove_punct = TRUE, # remove punctuation
                    remove_separators = TRUE, # remove separators
                    remove_symbols = TRUE, # remove symbols
                    split_hyphens = TRUE ) # splits words connected by -)

# Combine all the summaries
summary_corpus_byalbum <- rbind(summary_corpus_taylorswift_lyrics, summary_corpus_fearless_lyrics, summary_corpus_speaknow_lyrics, summary_corpus_red_lyrics, summary_corpus_ninteen89_lyrics, summary_corpus_reputation_lyrics, summary_corpus_lover_lyrics, summary_corpus_folklore_evermore_lyrics)

Album <- c("Taylor Swift", "Fearless", "Speak Now", "Red", "1989", "Reputation", "Lover", "Folklore & Evermore")

summary_corpus_byalbum <- cbind(Album, summary_corpus_byalbum)
summary_corpus_byalbum <- summary_corpus_byalbum %>%
  select(Album, n_of_songs, Types, Tokens, Year)
summary_corpus_byalbum <- as.data.frame(summary_corpus_byalbum)
summary_corpus_byalbum$Year <- as.factor(summary_corpus_byalbum$Year)

line_plot1 <- summary_corpus_byalbum %>% 
  group_by(Year) %>% 
  summarize(avg_number_words_per_song=sum(Tokens)/sum(n_of_songs)) %>%
  ggplot( aes(x=Year, y=avg_number_words_per_song)) +
  geom_line(aes(group=1), color="#1AC03B",linetype="dashed", size=1)+
  geom_point(shape=21,color="black", fill="#12FF41", size=2)+
  labs(title = "Average number of words per song",
       subtitle = "from 2006 to 2020",
       x="Year",
       y="Average number of words") +
  theme(
    plot.background = element_rect(fill = "#fbf9f4", color = "#fbf9f4"),
    panel.background = element_rect(fill = "#fbf9f4", color = "#fbf9f4"),
    legend.key = element_rect(fill = NA, color = NA),
    legend.background = element_blank()
  )
ggplotly(line_plot1) %>%
  layout(title = list(text = paste0('Average number of words per song',
                                    '<br>',
                                    '<sup>',
                                    'from 2006 to 2020',
                                    '</sup>')))

The plot shows the average number of words per song that she has written in a given year. When Swift started her career way back in 2006 with the self-titled album, she wrote, on average, 267 words per song. This average increased until 2017 (with a small decrease in 2012 for the album Red) but with the last three albums, Lover (2019), Folklore (2020) and Evermore (2020) it has started to decrease more evidently, probably due to the fact that she has shifted to the alternative genre, which is more focused on the melody rather than on using a lot of words.

Lexical diversity

However, to have a better understanding of the situation, we should also look at the type/tokens ratio (TTR) over time, which can be also seen as a measure of lexical diversity.

line_plot2 <- summary_corpus_byalbum %>% 
  group_by(Year) %>% 
  summarize(avg_different_number_words_per_song=(sum(Types)/sum(Tokens))*100) %>%
  ggplot( aes(x=Year, y=avg_different_number_words_per_song)) +
  geom_line(aes(group=1), color="#1AC03B",linetype="dashed", size=1)+
  geom_point(shape=21,color="black", fill="#12FF41", size=2)+
  labs(title = "Type/Token ratio",
       subtitle = "from 2006 to 2020",
       x="Year",
       y="TTR") +
  theme(
    plot.background = element_rect(fill = "#fbf9f4", color = "#fbf9f4"),
    panel.background = element_rect(fill = "#fbf9f4", color = "#fbf9f4"),
    legend.key = element_rect(fill = NA, color = NA),
    legend.background = element_blank())
ggplotly(line_plot2) %>%
  layout(title = list(text = paste0('Type/Token ratio',
                                    '<br>',
                                    '<sup>',
                                    'from 2006 to 2020',
                                    '</sup>')))

The highest value of the type-token ratio is reached in 2006, when actually Miss. Swift started her career. This means that her first album has the lowest average number of words pr song, but at least they are not always the same words. Then, there was a period (2012-2017) in which it was very low and this corresponds also to the period of pop songs, which have catchy but repeated chorus. In the last years we can see that this ratio is going back to its origin and could increase in the future.

The most used words

In a song a lot of words are repeated to make it more catchy, so let’s see which are the most used words also to understand the theme of Taylor’s songs.

# My stop words dictionary
word <- c("oh","ooh","eh","ha","mmm","mm", "yeah","ah","hey","eeh","uuh","uh","la","da","di","ra","huh","hu","whoa","gonna","wanna","gotta","em")
lexicon <- c(rep("mine",length(word)))
mystopwords <- cbind(word,lexicon)
mystopwords <- rbind(stop_words,mystopwords)
  
# Data preparation
lyrics_token<-taylor %>% 
  select(lyrics) %>% 
  unnest_tokens(input = lyrics , output = "word") %>% 
  count(word,sort=TRUE) %>% 
  anti_join(mystopwords,by="word") %>%
  filter(n>10)

# Add Taylor Swift Handwritting as font
library(showtext)
#font_add("TaylorSwiftHandwriting", regular = "C:/Users/Salvatore #Mancino/AppData/Local/Microsoft/Windows/Fonts/Taylor Swift Handwriting.ttf")

# Word cloud
showtext_auto()
wordcloud2(lyrics_token,
           shape = "star",
           color = "white",
           backgroundColor = "#1AC03B", 
           size=1
           #fontFamily = "TaylorSwiftHandwriting"
           )

As expected from a songwriter who writes about her personal life and experiences, the most common theme is that of love, seen from all the points of view. The most frequent words are, indeed, “love”, “baby”, “forever”. There is also the color “red” that strikes out the most, which is linked to passion, love and desire. Another important theme is that of abandonment as some frequent words are “leave”, “stay” and “lost”.

Network of words

Now, let’s see how the words themselves are connected.

# Couple of words 
taylor_bigrams <- taylor %>%
  unnest_tokens(bigram, lyrics, token = "ngrams", n = 2)
taylor_separated <- taylor_bigrams %>%  
  separate(bigram, into = c("word1", "word2"), sep = " ")

# Couple of words without stopwords
taylor_united <- taylor_separated %>%
  filter(!word1 %in% mystopwords$word,
         !word2 %in% mystopwords$word) %>%
  unite(bigram, c(word1, word2), sep = " ")

library(tidygraph)
library(ggraph)

# Number of times a couple of words is repeated
bigram_counts <- taylor_separated %>% 
  filter(!word1 %in% mystopwords$word,
         !word2 %in% mystopwords$word) %>% 
  count(word1, word2, sort = TRUE) 
bigram_graph <- bigram_counts %>% 
  filter(n > 6) %>%
  as_tbl_graph()

# Plot
ggraph(bigram_graph, layout = "fr") + 
  geom_edge_link(colour = "#1AC03B", ) +
  geom_edge_loop(colour = "#1AC03B") +
  geom_node_point(size = 2, colour = "#1AC03B") + 
  geom_node_text(aes(label = name), size = 4 ,vjust = 1, hjust = 1) +
  labs(title = "Most used words", subtitle = "in couple")+
  theme(panel.background = element_rect(fill = "white", colour = NA),
        plot.title = element_text(
        size = 18),
        plot.subtitle = element_text(size = 12))

From the above bigram, we can clearly see the pair of words which are most used. First of all, a lot of these pairs are the titles of the songs, like “Wildest Dreams” and “Gateway Car”. This is not strange at all, because the title of a song is repeated different times in it. But, there are also a lot of pair of words which are not titles. For instance, the word “bad” is usually followed either by “blood,”girl” or “feeling” and the only title of a song, here, is “Bad Blood”. Instead, as regards the words for which that are not connected to other words, they are simply connected to themselves in the sense that they are repeated a lot of times in a consecutive way. For example, if we read the lyrics of the song “Shake it off”, some sentences are “Baby, I’m just gonna shake, shake, shake, shake, shake”: the word “shake” is followed a lot of time by the same word.

Sentiment Analysis

Music is often quite emotionally gripping. By turns, it can make us feel sad or elated. It can convey a sense of unfulfilled longing, of awe and wonder. It can make us laugh or cry. Music may even convey anger or regret. All these feelings are not only evoked by the rithmic of the song, but also from its lyrics.

The most common emotions

It is important to understand which are the emotions that are most expressed by a singer, because according to how we want to feel, we may choose to listen to him/her or not.

So, let’s see which are the most present emotions in Taylor’s songs.

# Get the number of emotions in each text of the corpus of all the lyrics
emotion_nrc<-get_nrc_sentiment(corpus_taylor_lyrics)

# Get the total number of words that express an emotion
n_of_words <- as.numeric(colSums(emotion_nrc))
emotion <- c("anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", "trust", "negative", "positive")
emotion_nrc <- as.data.frame(cbind(emotion, n_of_words))
emotion_nrc$emotion <- as.factor(emotion_nrc$emotion)
emotion_nrc$n_of_words <- as.numeric(emotion_nrc$n_of_words)
emotion_nrc <- emotion_nrc %>%
  filter(emotion != "positive", emotion != "negative")

library(waffle)

#Let's define the rows and the columns of the waffles
rows=48
cols=48
TOT=rows*cols #total number of squares
xlab = paste0("1 square => ", round(100/(rows*cols),3),"%")

# Data preparation
waffle_sentiments <- emotion_nrc %>% 
  mutate(n_of_words=round(TOT*n_of_words/sum(n_of_words))) %>% # this is important because counts must be summed to TOT 
  pivot_wider(names_from = emotion,values_from = n_of_words)

# Plot
waffle(waffle_sentiments,
       rows=rows,
       xlab =xlab,
       flip = T,
       size=0.5)+
  labs(title="Emotions", 
       subtitle = "expressed by Taylor in her songs")+
  scale_fill_manual("Emotion",values=c("#A6CEE3", "#1F78B4", "#B2DF8A", "#33A02C","#E31A1C", "#FB9A99", "#FDBF6F", "#FF7F00"))

Taylor’s songs convey more emotions of joy, sadness and anticipation (feeling of excitement about something that is going to happen in the near future) than of disgust and anger. From someone whose favorite theme to talk about is love, we could not expect different results. Fear is another common feeling because, as Taylor wrote in her song I know places, “love’s a fragile little flame, it could burn out”.

The mood of the words

But what if we consider more general feelings, like positive and negative?

library(wordcloud)
library(reshape2)

# Get sentiments of the words from bing dictionary
bing_sentiments <- get_sentiments("bing")

# Merge to take taylor's words which express a sentiment
taylor_sentiment<-merge(bing_sentiments,lyrics_token,by="word") 

#Plot 
positive_negative_words <- taylor_sentiment %>%
  group_by(sentiment) %>% 
  summarise(n=sum(n)) %>% 
  ggplot(aes(x=sentiment, y=n))+
  geom_bar(stat = "identity", fill="#C81717", color="#0c0c0c")+
  labs(title = "Positive and negative words",
       x="Sentiment",
       y="Number of words")+
  theme(plot.background = element_rect(fill = "#fbf9f4", color = "#fbf9f4"),
        panel.background = element_rect(fill = "#fbf9f4", color = "#fbf9f4"))
ggplotly(positive_negative_words)

Among those words that express a feeling, 1154 are negative and 951 are positive. So, it is true that in the previous plot we saw that the most frequent emotion is joy, but overall we must say that the negative feelings prevail. This is due to high number of words expressing the emotion of “anticipation”, which could correspond to either positive or negative feelings, differently from “joy” or “sadness” which only correspond to positivity or negativity. Anticipation, indeed, is an emotion involving pleasure or anxiety in considering or awaiting an expected event.

The most used positive and negative words

Let’s see which are the most popular words that express positivity and negativity.

# Word cloud
taylor_sentiment %>% 
  acast(word ~ sentiment, value.var="n",fill= 0) %>% 
  comparison.cloud(colors=c("#3E4EE5","#C81717"),
                   random.order = F, 
                   title.colors = "white",
                   title.bg.colors = c("#3E4EE5","#C81717"))

Of course among the positive words we find “love”, that is also the most used word in general. But there are also words like “beautiful”, “shine”, “darling” and “smile”. Instead, among the words that express negativity we find “bad”, “shake”, “lost”, “break”.

Albums and feelings

Lastly, it may be interesting to see which albums spread more positivity and which albums spread more negativity so that when you want to listen to Taylor Swift and, for example, feel happy or sad (sometimes listening to sad songs helps to improve your mood) you know what to listen to.

# Total number of words in each album
tot<-taylor %>%
  select(album_name, album_release, lyrics) %>% 
  unnest_tokens(input = lyrics , output = "word") %>% 
  inner_join(bing_sentiments, by="word") %>% 
  group_by(album_name, album_release) %>% 
  summarise(tot_count=n()) 
tot$album_release <- as.Date(tot$album_release,"%Y-%m-%d")
tot <- as.data.frame(tot)
tot <- tot %>% 
  arrange(ymd(tot$album_release))

# Custom palette of colors
my_palette2= c("Taylor Swift" = "#000000" ,
             "Fearless" = "#000000", 
            "Speak Now" = "#000000",
            "Red" = "#000000",
            "1989" = "#000000",
            "Reputation" = "#000000",
            "Lover" = "#000000",
            "Folklore" = "#000000",
            "Evermore" = "#000000",
            "positive" = "#C81717",
            "negative" = "#3E4EE5")

# Number of positive and negative words in each album
chord_taylor<-taylor %>%
  select(album_name, album_release, lyrics) %>% 
  unnest_tokens(input = lyrics , output = "word") %>%
  inner_join(bing_sentiments, by="word") %>% 
  group_by(album_name,album_release,sentiment) %>% 
  summarise(count=n(),.groups="drop") 
chord_taylor$album_release <- as.Date(chord_taylor$album_release,"%Y-%m-%d")
chord_taylor <- as.data.frame(chord_taylor)
chord_taylor <- chord_taylor %>% 
arrange(ymd(chord_taylor$album_release))
tot_count<-rep(tot$tot_count, each=2)

# Percentage of the positive and negative number of words
chord_taylor<-cbind(chord_taylor,tot_count) %>% 
mutate(percent_count=count/tot_count) %>% 
select(album_name,sentiment,percent_count) 

library(chorddiag)
library(circlize)

# Plot
circos.clear()
circos.par(start.degree = 90, gap.degree = 2, track.margin = c(-0.1, 0.1), points.overflow.warning = FALSE)
par(mar =c(5.1, 4.1, 4.1, 2.1))

chordDiagram(
  x = chord_taylor, 
  grid.col = my_palette2,
  col = c("#3E4EE5","#C81717") ,
  transparency = 0.1,
  directional = 1,
  direction.type = c("diffHeight"), 
  diffHeight  = -0.04,
  annotationTrack = c("grid","name"), 
  annotationTrackHeight = c(0.05, 0.1),
  scale = TRUE,
  link.lwd = 1,    
  link.lty = 1,    
  link.border = 1,
  link.target.prop = TRUE)
  title(main="The mood of each album", col.main="#C81717", outer=F)

In general, for all the albums the percentage of positive and negative emotions is the same. However, there are two albums that strike the eye the most: “Lover” and “Folklore”. The former is definitely dominated by positive feelings while the latteris the only album to express more negativity than positivity.

Album sales

In the last section of this analysis, the attention will be focused on the sales of Swift’s albums. Swift is known for being one of the best-selling musicians of all time. When she releases new music, she always tops the chart. In 2014 she released a song titled “Track 3” because of an error and it topped the Canadian iTunes chart in 10 minutes. It’s not weird…if we do not consider the fact that the song was actually made of 8 seconds of static noise!

Total sales

Let’s see how much each album has sold all over the world. The sales include both pure copies and copies derived from streams. In particular, 1500 streams correspond to a pure sale.

library(treemap) #NEW
library(d3treeR) #NEW
 

library(dplyr)
library(readxl)

# Data preparation
albumsales <- read.csv("albumsales.csv")
puresales <- albumsales %>%
  dplyr::group_by(Album) %>%
  dplyr::summarise(Sales=sum(Sales))
Year = c("2014","2020","2008","2020","2019","2012","2017","2010","2006") # to add the year of release of the album
puresales <- cbind(puresales,Year)
puresales <- puresales[c("Album","Year","Sales")]
puresales <- as.data.frame(puresales)
puresales <- dplyr::arrange(puresales, puresales$Year, desc(puresales$Album)) # to reorder



# Convert 1500 streams in 1 pure copy
taylor <- taylor %>%
  dplyr::group_by(album_name) %>%
  mutate(Streams_equivalent_units=round(streams/1500), .after=streams) 
equivalent_album_units <- taylor %>%
  dplyr::group_by(album_name) %>%
  summarise(Streams_equivalent_album_units = sum(Streams_equivalent_units))
equivalent_album_units <- cbind(equivalent_album_units,Year)
equivalent_album_units <- equivalent_album_units[c("album_name","Year","Streams_equivalent_album_units")]
equivalent_album_units <- as.data.frame(equivalent_album_units)
equivalent_album_units <- dplyr::arrange(equivalent_album_units, equivalent_album_units$Year, desc(equivalent_album_units$album_name)) # to reorder
equivalent_album_units <- equivalent_album_units %>%
  rename(Album = album_name, Sales = Streams_equivalent_album_units)
sales <- rbind(puresales,equivalent_album_units)
sales <- sales %>%
  mutate(Type=rep(c("Pure Sales", "Streams"), each = 9))
sales <- dplyr::arrange(sales, sales$Year, sales$Type) # to reorder

# Create a variable "Album" with the corresponding release year
sales <- sales %>%
  mutate(year_parenthesis = paste("(",Year,")",sep = ""))
sales <- sales %>%
  mutate(album_year = paste(Album, year_parenthesis, sep = " "))

# Treemap
p <- treemap(sales,
             index=c("album_year","Type"),
             vSize="Sales",
             fontcolor.labels = "black",
             bg.labels = 0,
             type="index",
             palette = "Paired",
             align.labels=list(
               c("center", "top"), 
               c("right", "bottom")
             ),
              title = "Total sales of the albums",
        fontsize.title = 16, fontface.labels = "plain"
           )

1989 is Swift’s most sold album. The public has liked it probably because it is the most “commercial” one. We can see that the greatest part of its sales derive from pure copies with only a small portion deriving from streams. This is true for the first 5 albums. From Reputation (2017) onwards, the portion of sales derived from streams becomes more consistent compared to the portion of pure copies. This proves the fact that also for her the physical market is dying.

1989’s streams

As 1989 is Swift’s most sold album, let’s see if people still stream it 7 years after its release.

We consider only the streams of the last 6 weeks.

# Data preparation
album1989dailystreams <- read.csv("album1989dailystreams.csv")
album1989dailystreams$Week <- as.character(album1989dailystreams$Week)
album1989dailystreams$Date <- as.Date(album1989dailystreams$Date, "%Y-%m-%d")
album1989dailystreams$Day_week[album1989dailystreams$Day_week == "Monday"] <- "Mon"
album1989dailystreams$Day_week[album1989dailystreams$Day_week == "Tuesday"] <- "Tue"
album1989dailystreams$Day_week[album1989dailystreams$Day_week == "Wednesday"] <- "Wed"
album1989dailystreams$Day_week[album1989dailystreams$Day_week == "Thursday"] <- "Thu"
album1989dailystreams$Day_week[album1989dailystreams$Day_week == "Friday"] <- "Fri"
album1989dailystreams$Day_week[album1989dailystreams$Day_week == "Saturday"] <- "Sat"
album1989dailystreams$Day_week[album1989dailystreams$Day_week == "Sunday"] <- "Sun"
album1989dailystreams$Day_week <- as.factor(album1989dailystreams$Day_week)
album1989dailystreams$Daily_streams <- as.numeric(album1989dailystreams$Daily_streams)
album1989dailystreams <- album1989dailystreams %>%
   filter(Week == 11 | Week == 12 | Week==13|Week==14|Week==15|Week==16) 
album1989dailystreams$Daily_streams <- album1989dailystreams$Daily_streams - 3000000
new_album1989dailystreams <- album1989dailystreams %>%
  select(Week,Day_week,Daily_streams) %>% #select the right columns
  pivot_wider(values_from = "Daily_streams",names_from = "Day_week")

library(ggradar)

my_palette3 <- c("#A6CEE3", "#1F78B4", "#B2DF8A", "#33A02C","#E31A1C", "#FB9A99", "#FDBF6F", "#FF7F00")

# Plot
ggradar(new_album1989dailystreams,
        grid.min = 0, #what is the min value
        grid.max = 1000000, #what is the max value
        grid.mid = 500000,
        values.radar = c("3M", "3.5M", "4M"), #the labels of the grid
        group.point.size=2,
        legend.title = "Week",
        background.circle.colour = "white") +
  theme(
    plot.background = element_rect(fill = "#fbf9f4", color = "#fbf9f4"),
    panel.background = element_rect(fill = "#fbf9f4", color = "#fbf9f4"),
    plot.title.position = "plot",
    plot.title = element_text(size = 18),
    plot.subtitle = element_text(size = 12))+ 
  labs(title = "1989's daily streams", subtitle = "from 14th march to 24th april 2022") +
  scale_color_manual(labels = c("14 mar - 20 mar", "21 mar - 27 mar", "28 mar - 3 apr", "4 apr - 10 apr", "11 apr - 17 apr", " 18 apr - 24 apr"),
                     values = my_palette3)+
  theme_void()

Well, it seems that the album still gets a lot of streams after all this time. We can also see a pattern. Indeed, the streams tend to be lower on Saturdays and especially on Sundays, probably because people go out and do not have a lot of time to listen to music. Even if they stay at home, they may decide to use physical supports because they have more time. Instead, the streams are higher in the workdays and this could be due to the fact that when people go to work or to school they listen to music on the streaming platforms.