Josephine Johnson
Donasia Washington
Spotify was launched in 2008 in Europe and was released to the
United States in 2011. The digital service now houses over 100 million
songs, 5 million podcasts, and 350,000 audio books. However, Spotify is
notorious for their music exploration and curated playlists. Spotify’s
has songs ranging from the present day and all the way back to 1904. In
this report we will be looking at their curated playlist titled
Billboard Summer Hits from the year 1958 up until the year
2017.
First lets load the necessary packages and our dataset as well.
library(tidyverse)
Warning: package ‘tidyverse’ was built under R version 4.3.3
library(ggplot2)
library(ggrepel)
Warning: package ‘ggrepel’ was built under R version 4.3.2
library(stringr)
spotify_summer_hits <- "https://raw.githubusercontent.com/reisanar/datasets/master/all_billboard_summer_hits.csv"
summer_bill_hits <- read_csv(spotify_summer_hits)
Here is a quick glimpse into our data:
glimpse(summer_bill_hits)
Rows: 600
Columns: 22
$ danceability <dbl> 0.518, 0.543, 0.541, 0.408, 0.554, 0.679, 0.663, 0.684, 0.645, 0.388, 0.556, 0.703, 0.843, 0.551, 0.455, 0.588, 0.715, 0.468, 0.463, 0.568, 0.558, 0.6…
$ energy <dbl> 0.0600, 0.3320, 0.6760, 0.3970, 0.1890, 0.2790, 0.6190, 0.5560, 0.9430, 0.4340, 0.4560, 0.7530, 0.1200, 0.9340, 0.3270, 0.4490, 0.5720, 0.2810, 0.6070…
$ key <chr> "A#", "C", "C", "A", "E", "G", "F#", "B", "C", "G", "C", "A", "E", "G", "G", "F", "F#", "G#", "G#", "E", "A#", "E", "C#", "G", "D", "F", "C", "G", "E"…
$ loudness <dbl> -14.887, -11.573, -7.988, -12.536, -14.277, -10.386, -5.731, -10.602, -1.526, -11.997, -17.609, -11.783, -17.305, -6.660, -10.113, -8.782, -8.628, -13…
$ mode <chr> "major", "major", "major", "major", "major", "major", "major", "major", "major", "major", "major", "major", "major", "minor", "major", "major", "minor…
$ speechiness <dbl> 0.0441, 0.0317, 0.1350, 0.0300, 0.0279, 0.0384, 0.0334, 0.0377, 0.0393, 0.0354, 0.0295, 0.1350, 0.0788, 0.0481, 0.0328, 0.0339, 0.0527, 0.0354, 0.0299…
$ acousticness <dbl> 0.9870, 0.6690, 0.1880, 0.8730, 0.9150, 0.6450, 0.3360, 0.4680, 0.3850, 0.7890, 0.5150, 0.8030, 0.6110, 0.8220, 0.8300, 0.7230, 0.0809, 0.8320, 0.4820…
$ instrumentalness <dbl> 7.87e-06, 0.00e+00, 8.03e-01, 0.00e+00, 1.37e-05, 0.00e+00, 8.61e-06, 0.00e+00, 0.00e+00, 9.54e-01, 0.00e+00, 0.00e+00, 2.31e-06, 1.43e-01, 0.00e+00, …
$ liveness <dbl> 0.1610, 0.1340, 0.1230, 0.2800, 0.1320, 0.1180, 0.0622, 0.0664, 0.3700, 0.7280, 0.3310, 0.0997, 0.1240, 0.4190, 0.0997, 0.0989, 0.3380, 0.2050, 0.1170…
$ valence <dbl> 0.336, 0.795, 0.911, 0.697, 0.214, 0.854, 0.979, 0.867, 0.965, 0.873, 0.848, 0.921, 0.622, 0.960, 0.324, 0.889, 0.859, 0.479, 0.964, 0.947, 0.303, 0.7…
$ tempo <dbl> 127.870, 154.999, 76.231, 72.615, 136.714, 117.287, 185.165, 142.779, 147.768, 206.313, 105.290, 177.162, 128.532, 87.055, 106.303, 128.929, 129.897, …
$ track_uri <chr> "006Ndmw2hHxvnLbJsBFnPx", "5ayybTSXNwcarDtxQKqvWX", "4jmFSkpcqLOUN6scGU6BOO", "3c7KT5CN8uYRaK3xThhdYt", "2urRqmAFhjZKo8Z6sEGzEv", "1tKMOJW1eTIMSNtILF0…
$ duration_ms <dbl> 216373, 153933, 128360, 162773, 165293, 161253, 150600, 138733, 131720, 153533, 144133, 152067, 126800, 133093, 173840, 157733, 137107, 134453, 146933…
$ time_signature <dbl> 4, 4, 4, 4, 3, 3, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 3, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
$ key_mode <chr> "A# major", "C major", "C major", "A major", "E major", "G major", "F# major", "B major", "C major", "G major", "C major", "A major", "E major", "G mi…
$ playlist_name <chr> "summer_hits_1958", "summer_hits_1958", "summer_hits_1958", "summer_hits_1958", "summer_hits_1958", "summer_hits_1958", "summer_hits_1958", "summer_hi…
$ playlist_img <chr> "https://mosaic.scdn.co/640/5e8c49f7a8d161c1d6510999bd867b6a91640dae6488d1b4d3b17500498b1e648e8a15a663ee1cc083350f7f3e709bffbd6bf47bea2d4132c145484af0…
$ track_name <chr> "Nel blu dipinto di blu", "Poor Little Fool", "Patricia", "Little Star", "My True Love", "Just A Dream", "When (Originally Performed By The Kalin Twin…
$ artist_name <chr> "Domenico Modugno", "Ricky Nelson", "Pérez Prado", "The Elegants", "Jack Scott", "Jimmy Clanton", "Paris Music", "The Everly Brothers", "Bobby Darin",…
$ album_name <chr> "Tutto Modugno (Mister Volare)", "Ricky Nelson (Expanded Edition / Remastered)", "El Rey Del Mambo", "Little Star: The Best Of The Elegants", "Best Of…
$ album_img <chr> "https://i.scdn.co/image/5e8c49f7a8d161c1d6510999bd867b6a91640dae", "https://i.scdn.co/image/f0f2c3321ca683bdc121ba039b98c13bbf37d6b2", "https://i.scd…
$ year <dbl> 1958, 1958, 1958, 1958, 1958, 1958, 1958, 1958, 1958, 1958, 1959, 1959, 1959, 1959, 1959, 1959, 1959, 1959, 1959, 1959, 1960, 1960, 1960, 1960, 1960, …
When we first looked at the data we had a couple of initial
questions.
What is the average duration of song per year, and how does that compare to the most popular songs of other years?
Who was an artist that had multiple hits over the years?
Is there a direct correlation between the danceability of a song and its popularity?
Does tempo - by extension mood - have an affect on how popular a song gets?
While we explore our dataset we keep these questions in mind.
summary(summer_bill_hits)
danceability energy key loudness mode speechiness acousticness instrumentalness liveness
Min. :0.2170 Min. :0.0600 Length:600 Min. :-23.574 Length:600 Min. :0.02330 Min. :0.0000488 Min. :0.0000000 Min. :0.02480
1st Qu.:0.5457 1st Qu.:0.4768 Class :character 1st Qu.:-10.947 Class :character 1st Qu.:0.03280 1st Qu.:0.0417250 1st Qu.:0.0000000 1st Qu.:0.08595
Median :0.6480 Median :0.6405 Mode :character Median : -8.072 Mode :character Median :0.04140 Median :0.1620000 Median :0.0000032 Median :0.12400
Mean :0.6407 Mean :0.6221 Mean : -8.587 Mean :0.06866 Mean :0.2665156 Mean :0.0364316 Mean :0.17979
3rd Qu.:0.7402 3rd Qu.:0.7830 3rd Qu.: -5.862 3rd Qu.:0.06990 3rd Qu.:0.4472500 3rd Qu.:0.0007132 3rd Qu.:0.22275
Max. :0.9800 Max. :0.9890 Max. : -1.097 Max. :0.51700 Max. :0.9870000 Max. :0.9540000 Max. :0.98900
valence tempo track_uri duration_ms time_signature key_mode playlist_name playlist_img track_name
Min. :0.0695 Min. : 62.83 Length:600 Min. :103386 Min. :3.000 Length:600 Length:600 Length:600 Length:600
1st Qu.:0.4790 1st Qu.:100.22 Class :character 1st Qu.:192887 1st Qu.:4.000 Class :character Class :character Class :character Class :character
Median :0.6900 Median :120.01 Mode :character Median :226927 Median :4.000 Mode :character Mode :character Mode :character Mode :character
Mean :0.6488 Mean :120.48 Mean :229434 Mean :3.972
3rd Qu.:0.8482 3rd Qu.:133.84 3rd Qu.:257854 3rd Qu.:4.000
Max. :0.9860 Max. :210.75 Max. :557293 Max. :5.000
artist_name album_name album_img year
Length:600 Length:600 Length:600 Min. :1958
Class :character Class :character Class :character 1st Qu.:1973
Mode :character Mode :character Mode :character Median :1988
Mean :1988
3rd Qu.:2002
Max. :2017
print(summer_bill_hits)
Reordering the dataset so track_name and
artist_name are at the front of the dataset:
Songs are identified by their names or artists. Putting these 2 variables at the front of the dataset in at least one version of the dataframe made it a little easier to read both when organizing the data and for outsiders when looking at the data.
summer_bill_hits %>%
select("track_name", "artist_name", everything())
What is the average duration of song per year, and how does that compare to the most popular songs of other years?
Find the average song length for the years between 1958 & 2017:
summer_bill_hits %>%
group_by(playlist_name) %>%
summarize(avg_duration = mean(duration_ms, na.rm = TRUE)) %>%
mutate(avg_duration_mins = (avg_duration/1000)/60) %>%
select(-avg_duration)
# putting it in a dataframe
summer_hits_duration <- summer_bill_hits %>%
group_by(playlist_name) %>%
summarize(avg_duration = mean(duration_ms, na.rm = TRUE)) %>%
mutate(avg_duration_mins = (avg_duration/1000)/60) %>%
select(-avg_duration)
print(summer_hits_duration)
summer_hits_duration %>%
filter(avg_duration_mins < 3)
Majority of the songs are greater than 3 minutes, this is
interesting because when analyzing the data there is an obvious gradual
increase in song length over the scope of 60 years.
Graph to display this:
with_playlistName_mod <- summer_hits_duration %>%
mutate(playlist_year = str_extract(playlist_name, "\\d+"))
print(with_playlistName_mod)
ggplot(data = with_playlistName_mod, aes(x = playlist_year, y = avg_duration_mins)) +
geom_point(aes(), color = "#1DB954") +
theme(axis.text.x = element_blank()) +
theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "Average Duration per Year",
x = "Years from 1958 to 2017",
y = "Average Duration (mins)")
The average duration of popular songs throughout the years can tell us a lot about the tastes and preferences of listeners. There is a very clear positive to negative correlation when looking at the data points. When looking at a dataset that is dependent on the preferences of any population, it is important to keep any outside factors in mind; this is could be due to preference when the music was listened to, world events happening from the period the data was taken from, or general behaviors of the populace (such as attention span).
Who was an artist that had multiple hits over the years?
summer_bill_hits %>%
group_by(artist_name) %>%
count(sort = TRUE)
The artist named “Rihanna” had the most featured songs. Elton John and Katy Perry were a close second.
my_girl_riri <- summer_bill_hits %>%
filter(artist_name == "Rihanna") %>%
select(track_name, danceability, year)
my_girl_riri
Is there a direct correlation between the danceability of a song and its popularity?
Let’s first find the average danceablity over the years:
summer_bill_hits %>%
summarize(ave_dance = mean(danceability))
The average danceability seems to be 0.64. The range is from 0.2 to 1.0. This is visualized in the graph below:
ggplot(data = summer_bill_hits,
aes(x = year, y = danceability)) +
geom_point() +
geom_smooth(color = "#1DB954")
Lets take a look at Rihanna’s danceability.
ggplot(my_girl_riri,
aes(x = year, y = danceability, color = track_name)) +
geom_point(size = 3, alpha = .7) +
theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "Rihanna's Danceability",
x = "Years",
y = "Danceability")
The artist Rihanna’s songs are above the danceability average of 0.64. Only two of the seven featured songs have a low danceability score. If you’re familiar with her song Pon De Replay, then you could agree that it is a song that has that “summer” feeling to it.
Let’s check the other frequent artists to see if there is a pattern:
summer_bill_hits %>%
filter(artist_name %in% c("Rihanna", "Elton John", "Katy Perry", "Mariah Carey",
"The Rolling Stones", "Usher", "Donna Summer",
"The Beatles", "Wings")) %>%
ggplot(aes(x = year, y = danceability, color = artist_name)) +
geom_point(size = 8, alpha = .5) +
theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "Top Artist's Danceability", x = "Years", y = "Danceability")
It is clear that the average danceability is a useful factor in determining the popularity of a summer song. The more frequently featured artists have a high average danceability with their songs. This could be a key factor in being chosen for each summer playlist.
Does tempo - by extension mood - have an affect on how popular a song gets?
summer_bill_hits %>%
select("track_name", "tempo", "energy") %>%
group_by(tempo)
summer_bill_hits %>%
summarize(avg_tempo = mean(tempo, na.rm = TRUE))
my_girl_riri_2 <- summer_bill_hits %>%
filter(artist_name == "Rihanna") %>%
select(track_name, energy, tempo)
my_girl_riri_2
Energy vs Tempo
ggplot(data = summer_bill_hits,
aes(x = tempo, y = energy), alpha = .5) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "Tempo and Energy",
x = "Energy",
y = "Tempo")
ggplot(my_girl_riri_2,
aes(x = energy, y = tempo, color = track_name)) +
geom_point(size = 3, alpha = .7) +
theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "Rihanna's Energy",
x = "Energy",
y = "Tempo")
Tempo is the measurement of how fast or slow a song is. The
assumption was that a faster song has a larger energy. In
Tempo & Energy this is shown to be false. While the
data is generally positive, there is not a very strong correlation
between the two variables. A lot of songs on the list have a slower
tempo, but make up for it in energy, as shown in
Rihanna's Energy with the song Pon de Replay. While it is
not a fast tempoed, it more than makes up for it energy and danceability
as shown in Rihanna's Danceability.
Overall, Spotify goes through a lot of data to create their curated playlists. Many songs appear to be under the duration of 3 minutes long in early years, with a spike in the late 1900’s before decreasing again in the 2000’s. This could correlate with any outside factors as mentioned above - one of the most discussed being short attention spans in the more recent years. Also, the danceablity has a large impact on the popularity. The most frequent artists have high level of danceable songs. The more upbeat, short, and danceable a song is, the more likely it is to be featured on a summer Spotify playlist.
12/06/2023