I want to find common factors in the top songs on Spotify that might explain why these songs are so popular on ARIA and Billboard charts. For example, what degree of loudness, danceability, energy, explicity, etc. is desired by audiences? These are factors that determine which songs rise to the top and which songs flop on Spotify. I find this interesting because love to listening all kinds of music. From recent pop, to rap, to country, to older music from the Beatles, Spotify is my go-to platform to listen to my favorite songs. I would love to learn more about the songs that users like me enjoy most through this line of inquiry.
Dataset
I will answer these questions by using the a dataset from Kaggle entitled “Top 10000 Songs on Spotify 1960-Now.” This data is suitable to be used for this line of questioning because it provides different information and characteristics of each of the songs. These properties can be used to determine what aspects of the songs make them similar to and different from each other. Access to this dataset is included below. The dataset has 35 rows, not all of which will be used in my analysis. The characteristics that I will use most will be those that describe the song in question.
Brief Data Dictionary (of Variables I Will Use):
Attribute
Description
Track Name
The name of the song
Artist Name(s)
Name(s) of the singers
Explicit
Labeled TRUE if explicit, FALSE if non-explicit
Popularity
How popular the song is on a scale from 0-100
Danceability
How danceable the song is on a scale from 0.00-1.00
Energy
How energetic the song is on a scale from 0.00-1.00
Loudness
How loud the song is (no scale given)
Speechiness
How verbose the song is on a scale from 0.00-1.00
Acousticness
How acoustic the song is on a scale from 0.00-1.00
Instrumentalness
How instrumental the song is on a scale from 0.00-1.00
Kaggle “Top 10000 Songs on Spotify 1960-Now” Dataset
library(readr)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.1 ✔ purrr 1.0.1
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest)
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
library(xml2)library(httr)library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
Rows: 9999 Columns: 35
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Track URI, Track Name, Artist URI(s), Artist Name(s), Album URI, ...
dbl (16): Disc Number, Track Number, Track Duration (ms), Popularity, Dance...
lgl (2): Explicit, Album Genres
dttm (1): Added At
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Summary Statistics
Attribute
Minimum
Median
Mean
Maximum
Popularity
0.00
42.00
37.62
98.000
Danceability
0.00
0.6170
0.6079
0.9880
Energy
0.0000203
0.7120000
0.6832807
0.9970000
Loudness
-29.368
-6.518
-7.269
2.769
Speechiness
0.00
0.04290
0.06514
0.71100
Acousticness
0.0000027
0.0956000
0.2085887
0.9910000
Instrumentalness
0.00
0.0000057
0.0293313
0.9850000
Descriptive Analysis
Visualization 1: Explicit Songs
To create this visualization, I created a new variable in the data that labels each song as either “Explicit” or “Not explicit.” This is based on the Boolean response under the “Explicit” column in the original dataset. From there, I created a bar plot that determines the average popularity score for explicit versus non-explicit songs.
songs$explicittext <-ifelse(songs$Explicit==FALSE,"Not explicit","Explicit")songs %>%ggplot(aes(x=explicittext, y = Popularity)) +geom_bar(stat ="summary", fun = mean) +labs(title ="Average Popularity Score of Explicit vs. Non-Explicit Songs",x ="Explicity",y ="Average Popularity")
The plot shows that explicit songs have a higher average score by about 7 points as opposed to non-explicit songs. This suggests that the use of explicity will help a song to climb in popularity.
Visualization 2: Danceability
songs %>%filter(Popularity>50)%>%ggplot(aes(x=Danceability, y = Popularity)) +geom_point() +labs(title ="Popularity Score with Respect to Danceability",subtitle ="For Songs with a Popularity Score over 50",x ="Danceability",y ="Popularity") +geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The above plot shows that danceability has a slight positive correlation with a song’s popularity for songs with a popularity score over 50. In other words, if a song is more danceable, it is more likely to have a higher popularity score. This makes sense, as people are more likely to listen to songs that they can dance to and enjoy together.
Visualization 3: Energy
songs %>%ggplot(aes(x=Energy, y = Popularity)) +geom_point() +labs(title ="Popularity Score with Respect to Energy",x ="Energy",y ="Popularity") +geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
As shown above, energy does not have a huge impact on a song’s popularity. Reasoning for this could be that some of the most popular songs are contemporary, modern, or R & B, which do not require a lot of energy within the song to be successful
Visualization 4: Loudness
songs %>%filter(Popularity>60) %>%ggplot(aes(x=Loudness, y = Popularity)) +geom_point() +labs(title ="Popularity Score with Respect to Loudness",subtitle ="For Songs with a Popularity Score Over 60",x ="Loudness",y ="Popularity") +geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
It appears that loudness has a slight positive effect on a song’s popularity for songs with a popularity score over 60. Perhaps people enjoy louder songs because they have more impactful and recognizable than others.
Visualization 5: Energy vs. Danceability
songs %>%filter(Popularity>60)%>%ggplot(aes(x=Energy, y = Danceability)) +geom_point() +labs(title ="Energy versus Danceability",subtitle ="Of Songs with Popularity Over 60",x ="Energy",y ="Danceability") +geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
This graph shows us that generally, as energy increases, so does danceability for songs with a popularity score over 60. These findings generally make sense, being that danceable songs tend to require more energy.
Secondary Source
For my secondary source, I scraped Genius.com for the lyrics of the 100 most popular songs in the Kaggle dataset. Then, I performed sentiment analysis with the NRC emotive lexicon on the lyrics of each song to find their respective anger, anticipation, disgust, fear, joy, negativeness, positiveness, sadness, surprise, trust. I also include a positivity column that computes the total positivity of the song using the positive minus negative score.
A table of my findings is linked below. I will use this data to learn more about how different emotions relate to different characteristics of songs, such as danceability, energy, speechiness, etc. My findings using the NRC emotive lexicon on lyrics from Genius.com will give more insight into the original Kaggle dataset on Spotify.
Rows: 100 Columns: 47
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Track URI, Track Name, Artist URI(s), Artist Name(s), Album URI, ...
dbl (29): Disc Number, Track Number, Track Duration (ms), Popularity, Dance...
lgl (1): Explicit
dttm (1): Added At
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Visualization 6: Impact of Surprise Score on Danceability
songssentiments %>%ggplot(aes(x=surprise, y = Danceability)) +geom_point() +labs(title ="Impact of Surprise Score on Danceability",subtitle ="On the Top 100 Most Popular Songs",x ="Surprise",y ="Danceability") +geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As a song rates higher in the surprise aspect, its danceability typically increases.
Visualization 7: Impact of Joy on Loudness
songssentiments %>%ggplot(aes(x=joy, y = Loudness)) +geom_point() +labs(title ="Impact of Joy Score on Loudness",subtitle ="On the Top 100 Most Popular Songs",x ="Joy",y ="Loudness") +geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
It appears that there is a slight positive correlation between the joy score of a song and its loudness level.
Visualization 8: Impact of Anticipation on Energy
songssentiments %>%ggplot(aes(x=anticipation, y = Energy)) +geom_point() +labs(title ="Impact of Anticipation Score on Energy",subtitle ="On the Top 100 Most Popular Songs",x ="Anticipation",y ="Energy") +geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Overall, there is not much variance between anticipation and energy level of a song.
Visualization 9: Impact of Positive Score on Energy
songssentiments %>%ggplot(aes(x=positive, y = Energy)) +geom_point() +labs(title ="Impact of Positive Score on Energy",subtitle ="On the Top 100 Most Popular Songs",x ="Positive Score",y ="Energy") +geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
There seems to be a slight positive correlation between a song’s surprise score and its danceability level.
Other Findings with My Secondary Source
It is important to note that scraping Genius.com data in this way allows me to find similarities and differences between artists, and even individual songs as well. I will present some more of my findings that I found interesting below
Visualization 10: Sentiments of Top Artists on the “songs” Dataset
According to the “songs” dataset, Olivia Rodrigo, Dua Lipa, Miley Cyrus, OneRepublic, and Taylor Swift all are artists of the 10 most popular songs. I want to find out: which of these artists has the most positive and most negative lyrics?
The artists have similar ratios of emotions within their lyrics because though I scraped 20 songs for all artists, these songs differed in length and wordiness. This shows us that Taylor Swift has some of the most verbose songs while OneRepublic and Dua Lipa are less wordy and lengthy.
Other than that, I must note that Miley Cyrus outranks Taylor Swift (and everyone else) in joy-related lyrics even though Taylor’s songs are wordier than hers. OneRepublic scored the lowest on joy- and positive-related words, which implies that a lot of their songs are melancholic.
Visualization 11: “Flowers” vs. “Vampire”
According to the “songs” dataset, “Vampire” by Olivia Rodrigo and “Flowers” by Miley Cyrus are both in the top 3 most popular songs. They are also comparable because they’re both around the same length and by solo female artists. I wonder how the sentiment analysis of these songs compare chronologically?
The above chart shows us that overall, “Flowers” is a more positive song than “Vampire.” One could assume this is the case even before running the analysis simply based on the titles of the songs. It is crucial to note that “Vampire” has a significantly higher amount of negative words which are scattered throughout the song. On the other hand, “Flowers” has a plethora of positive lyrics that are positioned at four specific hot spots. It has fewer negative words and even has a couple double positive rankings at certain time frames of the song.
Recommendations and Conclusion
This analysis of characteristics of the most popular songs might be helpful to an artist trying to create a hit. These findings have been true for popular songs from the 1960s to now, which is noteworthy as they have been tested through generations of music. For artists attempting to produce a top hit, I suggest the following:
Make the song explicit, as explicit songs have a higher popularity score on average.
Try to add some degree of danceability to the song. Songs that people can dance to tend to be more popular.
Louder songs are more popular than softer songs. Use your artistic talents to make the song loud, yet tasteful.
In order to do this, the lyrics of the song must invoke certain feelings. Words with different connotations to them are useful in boosting their popularity. For example, I would recommend:
Using lyrics that invoke surprise. Words related to surprise increase a song’s danceability, which increases its popularity.
Using lyrics associated with joy. According to my findings, joyful words increase loudness, which increases popularity.
Possibly use positive and anticipation-related words. These words would boost a song’s energy, which has an unclear correlation to its popularity score. This can be tried and tested more under a longer period of time for more clear results.
Overall, learning more about the most popular songs on Spotify’s platform and their characteristics in conjunction with scraping Genius.com for their lyrics provides me with a plethora of information on what makes a song popular. This includes intrinsic variables of a song, like its loudness, energy, and danceability, as well as sentiment analysis of its lyrics, resulting in words categorized as positive, negative, angry, joyful, etc. I was able to tie together the intrinsic variables of a song to the sentiment analysis of its lyrics for the top 100 most popular songs on Spotify, providing insight into their makeup. This allows further understanding into what have made songs popular in the past, and possibly into what will continue to make them popular into the future.