Final Project

Author

Grace Demaree

Analysis of Popular Music

Introduction

I want to find common factors in the top songs on Spotify that might explain why these songs are so popular on ARIA and Billboard charts. For example, what degree of loudness, danceability, energy, explicity, etc. is desired by audiences? These are factors that determine which songs rise to the top and which songs flop on Spotify. I find this interesting because love to listening all kinds of music. From recent pop, to rap, to country, to older music from the Beatles, Spotify is my go-to platform to listen to my favorite songs. I would love to learn more about the songs that users like me enjoy most through this line of inquiry.

Dataset

I will answer these questions by using the a dataset from Kaggle entitled “Top 10000 Songs on Spotify 1960-Now.” This data is suitable to be used for this line of questioning because it provides different information and characteristics of each of the songs. These properties can be used to determine what aspects of the songs make them similar to and different from each other. Access to this dataset is included below. The dataset has 35 rows, not all of which will be used in my analysis. The characteristics that I will use most will be those that describe the song in question.

Brief Data Dictionary (of Variables I Will Use):

Attribute	Description
Track Name	The name of the song
Artist Name(s)	Name(s) of the singers
Explicit	Labeled TRUE if explicit, FALSE if non-explicit
Popularity	How popular the song is on a scale from 0-100
Danceability	How danceable the song is on a scale from 0.00-1.00
Energy	How energetic the song is on a scale from 0.00-1.00
Loudness	How loud the song is (no scale given)
Speechiness	How verbose the song is on a scale from 0.00-1.00
Acousticness	How acoustic the song is on a scale from 0.00-1.00
Instrumentalness	How instrumental the song is on a scale from 0.00-1.00

Kaggle “Top 10000 Songs on Spotify 1960-Now” Dataset

library(readr)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ purrr     1.0.1
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(rvest)


Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

library(xml2)
library(httr)
library(magrittr)


Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract

library(tidytext)
songs<-read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/demareeg_xavier_edu/EakH8MtHgIhMpxZdnlgoS1wB7MZF7jH8lNtN-BkXELq5dQ?download=1")

Rows: 9999 Columns: 35
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (16): Track URI, Track Name, Artist URI(s), Artist Name(s), Album URI, ...
dbl  (16): Disc Number, Track Number, Track Duration (ms), Popularity, Dance...
lgl   (2): Explicit, Album Genres
dttm  (1): Added At

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Summary Statistics

Attribute	Minimum	Median	Mean	Maximum
Popularity	0.00	42.00	37.62	98.000
Danceability	0.00	0.6170	0.6079	0.9880
Energy	0.0000203	0.7120000	0.6832807	0.9970000
Loudness	-29.368	-6.518	-7.269	2.769
Speechiness	0.00	0.04290	0.06514	0.71100
Acousticness	0.0000027	0.0956000	0.2085887	0.9910000
Instrumentalness	0.00	0.0000057	0.0293313	0.9850000

Descriptive Analysis

Visualization 1: Explicit Songs

To create this visualization, I created a new variable in the data that labels each song as either “Explicit” or “Not explicit.” This is based on the Boolean response under the “Explicit” column in the original dataset. From there, I created a bar plot that determines the average popularity score for explicit versus non-explicit songs.

songs$explicittext <- 
  ifelse(songs$Explicit==FALSE,"Not explicit","Explicit")
songs %>% 
  ggplot(aes(x=explicittext, y = Popularity)) +
  geom_bar(stat = "summary", fun = mean) +
  labs(title = "Average Popularity Score of Explicit vs. Non-Explicit Songs",
       x = "Explicity",
       y = "Average Popularity")

The plot shows that explicit songs have a higher average score by about 7 points as opposed to non-explicit songs. This suggests that the use of explicity will help a song to climb in popularity.

Visualization 2: Danceability

songs %>% 
  filter(Popularity>50)%>%
  ggplot(aes(x=Danceability, y = Popularity)) +
  geom_point() +
  labs(title = "Popularity Score with Respect to Danceability",
       subtitle = "For Songs with a Popularity Score over 50",
       x = "Danceability",
       y = "Popularity") +
  geom_smooth()

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The above plot shows that danceability has a slight positive correlation with a song’s popularity for songs with a popularity score over 50. In other words, if a song is more danceable, it is more likely to have a higher popularity score. This makes sense, as people are more likely to listen to songs that they can dance to and enjoy together.

Visualization 3: Energy

songs %>%
  ggplot(aes(x=Energy, y = Popularity)) +
  geom_point() +
  labs(title = "Popularity Score with Respect to Energy",
       x = "Energy",
       y = "Popularity") +
  geom_smooth()

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 2 rows containing missing values (`geom_point()`).

As shown above, energy does not have a huge impact on a song’s popularity. Reasoning for this could be that some of the most popular songs are contemporary, modern, or R & B, which do not require a lot of energy within the song to be successful

Visualization 4: Loudness

songs %>% 
  filter(Popularity>60) %>%
  ggplot(aes(x=Loudness, y = Popularity)) +
  geom_point() +
  labs(title = "Popularity Score with Respect to Loudness",
       subtitle = "For Songs with a Popularity Score Over 60",
       x = "Loudness",
       y = "Popularity") +
  geom_smooth()

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

It appears that loudness has a slight positive effect on a song’s popularity for songs with a popularity score over 60. Perhaps people enjoy louder songs because they have more impactful and recognizable than others.

Visualization 5: Energy vs. Danceability

songs %>% 
  filter(Popularity>60)%>%
  ggplot(aes(x=Energy, y = Danceability)) +
  geom_point() +
  labs(title = "Energy versus Danceability",
       subtitle = "Of Songs with Popularity Over 60",
       x = "Energy",
       y = "Danceability") +
  geom_smooth()

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

This graph shows us that generally, as energy increases, so does danceability for songs with a popularity score over 60. These findings generally make sense, being that danceable songs tend to require more energy.

Secondary Source

For my secondary source, I scraped Genius.com for the lyrics of the 100 most popular songs in the Kaggle dataset. Then, I performed sentiment analysis with the NRC emotive lexicon on the lyrics of each song to find their respective anger, anticipation, disgust, fear, joy, negativeness, positiveness, sadness, surprise, trust. I also include a positivity column that computes the total positivity of the song using the positive minus negative score.

A table of my findings is linked below. I will use this data to learn more about how different emotions relate to different characteristics of songs, such as danceability, energy, speechiness, etc. My findings using the NRC emotive lexicon on lyrics from Genius.com will give more insight into the original Kaggle dataset on Spotify.

Scraped Dataset

songssentiments <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/demareeg_xavier_edu/EWJY8tL5SI9HuimQZ6owgYIBETHhAxaKcp6_VB2gKS6G0g?download=1")

Rows: 100 Columns: 47
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (16): Track URI, Track Name, Artist URI(s), Artist Name(s), Album URI, ...
dbl  (29): Disc Number, Track Number, Track Duration (ms), Popularity, Dance...
lgl   (1): Explicit
dttm  (1): Added At

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Visualization 6: Impact of Surprise Score on Danceability

songssentiments %>% 
  ggplot(aes(x=surprise, y = Danceability)) +
  geom_point() +
  labs(title = "Impact of Surprise Score on Danceability",
       subtitle = "On the Top 100 Most Popular Songs",
       x = "Surprise",
       y = "Danceability") +
  geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

As a song rates higher in the surprise aspect, its danceability typically increases.

Visualization 7: Impact of Joy on Loudness

songssentiments %>% 
  ggplot(aes(x=joy, y = Loudness)) +
  geom_point() +
  labs(title = "Impact of Joy Score on Loudness",
       subtitle = "On the Top 100 Most Popular Songs",
       x = "Joy",
       y = "Loudness") +
  geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

It appears that there is a slight positive correlation between the joy score of a song and its loudness level.

Visualization 8: Impact of Anticipation on Energy

songssentiments %>% 
  ggplot(aes(x=anticipation, y = Energy)) +
  geom_point() +
  labs(title = "Impact of Anticipation Score on Energy",
       subtitle = "On the Top 100 Most Popular Songs",
       x = "Anticipation",
       y = "Energy") +
  geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Overall, there is not much variance between anticipation and energy level of a song.

Visualization 9: Impact of Positive Score on Energy

songssentiments %>% 
  ggplot(aes(x=positive, y = Energy)) +
  geom_point() +
  labs(title = "Impact of Positive Score on Energy",
       subtitle = "On the Top 100 Most Popular Songs",
       x = "Positive Score",
       y = "Energy") +
  geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There seems to be a slight positive correlation between a song’s surprise score and its danceability level.

Other Findings with My Secondary Source

It is important to note that scraping Genius.com data in this way allows me to find similarities and differences between artists, and even individual songs as well. I will present some more of my findings that I found interesting below

Visualization 10: Sentiments of Top Artists on the “songs” Dataset

According to the “songs” dataset, Olivia Rodrigo, Dua Lipa, Miley Cyrus, OneRepublic, and Taylor Swift all are artists of the 10 most popular songs. I want to find out: which of these artists has the most positive and most negative lyrics?

The artists have similar ratios of emotions within their lyrics because though I scraped 20 songs for all artists, these songs differed in length and wordiness. This shows us that Taylor Swift has some of the most verbose songs while OneRepublic and Dua Lipa are less wordy and lengthy.

Other than that, I must note that Miley Cyrus outranks Taylor Swift (and everyone else) in joy-related lyrics even though Taylor’s songs are wordier than hers. OneRepublic scored the lowest on joy- and positive-related words, which implies that a lot of their songs are melancholic.

Visualization 11: “Flowers” vs. “Vampire”

According to the “songs” dataset, “Vampire” by Olivia Rodrigo and “Flowers” by Miley Cyrus are both in the top 3 most popular songs. They are also comparable because they’re both around the same length and by solo female artists. I wonder how the sentiment analysis of these songs compare chronologically?

The above chart shows us that overall, “Flowers” is a more positive song than “Vampire.” One could assume this is the case even before running the analysis simply based on the titles of the songs. It is crucial to note that “Vampire” has a significantly higher amount of negative words which are scattered throughout the song. On the other hand, “Flowers” has a plethora of positive lyrics that are positioned at four specific hot spots. It has fewer negative words and even has a couple double positive rankings at certain time frames of the song.

Recommendations and Conclusion

This analysis of characteristics of the most popular songs might be helpful to an artist trying to create a hit. These findings have been true for popular songs from the 1960s to now, which is noteworthy as they have been tested through generations of music. For artists attempting to produce a top hit, I suggest the following:

Make the song explicit, as explicit songs have a higher popularity score on average.
Try to add some degree of danceability to the song. Songs that people can dance to tend to be more popular.
Louder songs are more popular than softer songs. Use your artistic talents to make the song loud, yet tasteful.

In order to do this, the lyrics of the song must invoke certain feelings. Words with different connotations to them are useful in boosting their popularity. For example, I would recommend:

Using lyrics that invoke surprise. Words related to surprise increase a song’s danceability, which increases its popularity.
Using lyrics associated with joy. According to my findings, joyful words increase loudness, which increases popularity.
Possibly use positive and anticipation-related words. These words would boost a song’s energy, which has an unclear correlation to its popularity score. This can be tried and tested more under a longer period of time for more clear results.

Overall, learning more about the most popular songs on Spotify’s platform and their characteristics in conjunction with scraping Genius.com for their lyrics provides me with a plethora of information on what makes a song popular. This includes intrinsic variables of a song, like its loudness, energy, and danceability, as well as sentiment analysis of its lyrics, resulting in words categorized as positive, negative, angry, joyful, etc. I was able to tie together the intrinsic variables of a song to the sentiment analysis of its lyrics for the top 100 most popular songs on Spotify, providing insight into their makeup. This allows further understanding into what have made songs popular in the past, and possibly into what will continue to make them popular into the future.