This presentation will be for market research agency NEPA, where I will show case an approach to this data. My aim is to communicate what usage this data might have and for who, what tools I will use to navigate myself in this territory and to bring ideas on how to further expand upon the product.
One part of this data is collected from Spotify’s Top 200 charts and include variables such as artist, track title, date, daily streams and daily ranking in the Top 200.
The other part is fetched through Spotify’s Web API, where access to further variables about the artist and track are available, such as tempo, key of the song, speechiness, instrumentalness, etc.
First we’ll load the libraries needed for this project.
library(tidyverse) ## data viz, data wrangling, etc.
library(tidymodels) ## data modeling
library(zoo) ## rolling averages
library(spotifyr) ## Spotify web API R wrapper
library(kableExtra)
Load plot functions, ggplot2 theming and other libraries from script.
source("appPlotfunctions.R")
This data is already tidied data and wrangled from two other scripts. One is the “API-import-wrangle.Rmd”, where I’ve utilised Spotify’s Web API to connect a dataset of Spotify’s Top 200 that was scraped from spotifycharts.com, ranging from 2017-2021 with daily streams and rankings.
Together they create this dataset that I have saved specifically for my thesis project and, will be used for this notebook presentation aswell.
(This has already been csv-imported through “appPlotfunctions.R”)
As we can see below, this dataset contains 42 columns, so although it is technically tidy, it is not yet useable for any purpose at first glance.
head(globalcharts)
## # A tibble: 6 × 42
## title artist streams rank date year quarter month trend min_streams
## <chr> <chr> <dbl> <dbl> <date> <dbl> <chr> <chr> <chr> <dbl>
## 1 STAY (… The Ki… 7714466 1 2021-09-30 2021 2021/Q3 2021… SAME… 5240809
## 2 INDUST… Lil Na… 6517968 2 2021-09-30 2021 2021/Q3 2021… SAME… 4331516
## 3 Heat W… Glass … 4460880 3 2021-09-30 2021 2021/Q3 2021… SAME… 730942
## 4 My Uni… Coldpl… 4142687 4 2021-09-30 2021 2021/Q3 2021… MOVE… 4013065
## 5 Bad Ha… Ed She… 4077321 5 2021-09-30 2021 2021/Q3 2021… MOVE… 3777161
## 6 Pepas Farruko 3982650 6 2021-09-30 2021 2021/Q3 2021… SAME… 693354
## # … with 32 more variables: max_streams <dbl>, avg_streams <dbl>,
## # median_streams <dbl>, total_streams <dbl>, min_rank <dbl>, max_rank <dbl>,
## # avg_rank <dbl>, median_rank <dbl>, days <int>, artist_days <int>,
## # streamstitle_3dMA <dbl>, streamstitle_7dMA <dbl>, streamstitle_14dMA <dbl>,
## # streamstitle_30dMA <dbl>, streams_7dMA <dbl>, streams_14dMA <dbl>,
## # streams_30dMA <dbl>, streams_90dMA <dbl>, danceability <dbl>, energy <dbl>,
## # key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>, …
First we would like to focus on the variables related to streaming, ‘title’, ‘artist’, ‘streams’, ‘rank’, ‘trend’, streams rolling averages (by rank) and any metrics related to streams or time (dates).
Let’s go ahead and drop the rest of the columns and name our new dataframe “global_streaming”.
global_streaming <- globalcharts %>%
select(title:artist_days, streams_7dMA, streams_14dMA) %>%
rename(streamsrank_7dMA = streams_7dMA,
streamsrank_14dMA = streams_14dMA)
global_streaming %>%
head(10) %>%
kbl() %>%
kable_material_dark()
| title | artist | streams | rank | date | year | quarter | month | trend | min_streams | max_streams | avg_streams | median_streams | total_streams | min_rank | max_rank | avg_rank | median_rank | days | artist_days | streamsrank_7dMA | streamsrank_14dMA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| STAY (with Justin Bieber) | The Kid LAROI | 7714466 | 1 | 2021-09-30 | 2021 | 2021/Q3 | 2021/9 | SAME_POSITION | 5240809 | 10629302 | 8388760 | 8716512 | 704655831 | 1 | 4 | 1 | 1 | 84 | 741 | NA | NA |
| INDUSTRY BABY (feat. Jack Harlow) | Lil Nas X | 6517968 | 2 | 2021-09-30 | 2021 | 2021/Q3 | 2021/9 | SAME_POSITION | 4331516 | 8808346 | 6089915 | 6113264 | 426294063 | 2 | 13 | 3 | 2 | 70 | 1221 | NA | NA |
| Heat Waves | Glass Animals | 4460880 | 3 | 2021-09-30 | 2021 | 2021/Q3 | 2021/9 | SAME_POSITION | 730942 | 4784611 | 1938349 | 1774850 | 552429333 | 3 | 200 | 43 | 34 | 285 | 287 | NA | NA |
| My Universe | Coldplay, BTS | 4142687 | 4 | 2021-09-30 | 2021 | 2021/Q3 | 2021/9 | MOVE_UP | 4013065 | 6768788 | 4684302 | 4219891 | 32790116 | 3 | 5 | 4 | 3 | 7 | 8 | NA | NA |
| Bad Habits | Ed Sheeran | 4077321 | 5 | 2021-09-30 | 2021 | 2021/Q3 | 2021/9 | MOVE_DOWN | 3777161 | 6008275 | 5006624 | 5117893 | 490649138 | 3 | 19 | 4 | 4 | 98 | 9860 | NA | NA |
| Pepas | Farruko | 3982650 | 6 | 2021-09-30 | 2021 | 2021/Q3 | 2021/9 | SAME_POSITION | 693354 | 4991996 | 3413331 | 3921750 | 276479845 | 3 | 189 | 20 | 8 | 81 | 343 | NA | NA |
| Woman | Doja Cat | 3905977 | 7 | 2021-09-30 | 2021 | 2021/Q3 | 2021/9 | SAME_POSITION | 405110 | 4541274 | 2571863 | 3017278 | 234039535 | 5 | 200 | 53 | 19 | 91 | 1675 | NA | NA |
| Shivers | Ed Sheeran | 3706870 | 8 | 2021-09-30 | 2021 | 2021/Q3 | 2021/9 | SAME_POSITION | 2755593 | 4465660 | 3661475 | 3697536 | 76890977 | 8 | 13 | 9 | 9 | 21 | 9860 | NA | NA |
| THATS WHAT I WANT | Lil Nas X | 3529010 | 9 | 2021-09-30 | 2021 | 2021/Q3 | 2021/9 | SAME_POSITION | 3333903 | 5208116 | 4170450 | 4247074 | 58386295 | 3 | 9 | 6 | 6 | 14 | 1221 | NA | NA |
| Beggin’ | Måneskin | 3367619 | 10 | 2021-09-30 | 2021 | 2021/Q3 | 2021/9 | SAME_POSITION | 721300 | 8005228 | 4916102 | 4978832 | 599764421 | 1 | 195 | 14 | 4 | 122 | 377 | NA | NA |
Second, we also want the handle “NA” in the columns for danceability, energy etc. These are audio features and only appear once for each title, instead of filling up the dataset with repeated numbers all over the place, it saves memory to only keep them in the dataset once, we can extract the audio features for each track into it’s own dataframe and name it “global_audiofeatures”.
global_audiofeats <- globalcharts %>%
select(title, artist, energy:time_signature, total_streams, days, artist_days) %>%
drop_na()
global_audiofeats %>%
head(10) %>%
kbl() %>%
kable_material_dark()
| title | artist | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration | time_signature | total_streams | days | artist_days |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Last One Standing (feat. Polo G, Mozzy & Eminem) | Skylar Grey | 0.692 | 1 | -4.128 | 0 | 0.1270 | 0.23700 | 0.00e+00 | 0.0798 | 0.233 | 154.924 | 257.369s (~4.29 minutes) | 4 | 1199762 | 1 | 1 |
| fue mejor (feat. SZA) | Kali Uchis | 0.384 | 5 | -7.414 | 0 | 0.0353 | 0.23900 | 2.27e-04 | 0.0762 | 0.230 | 106.002 | 230.965s (~3.85 minutes) | 4 | 736722 | 1 | 228 |
| ONLY | LeeHi | 0.296 | 5 | -7.451 | 1 | 0.0346 | 0.89200 | 0.00e+00 | 0.0873 | 0.151 | 122.907 | 240.907s (~4.02 minutes) | 3 | 2212553 | 3 | 3 |
| Ya Supérame (En Vivo) | Grupo Firme | 0.400 | 0 | -6.521 | 1 | 0.1100 | 0.46300 | 0.00e+00 | 0.7600 | 0.568 | 75.576 | 189.05s (~3.15 minutes) | 3 | 677450 | 1 | 1 |
| LOCO | ITZY | 0.886 | 1 | -3.067 | 1 | 0.1770 | 0.01090 | 2.21e-05 | 0.3250 | 0.497 | 102.012 | 191.462s (~3.19 minutes) | 4 | 6118531 | 6 | 97 |
| A Maior Saudade - Ao Vivo | Henrique & Juliano | 0.626 | 11 | -4.780 | 1 | 0.0533 | 0.57100 | 0.00e+00 | 0.8700 | 0.305 | 162.824 | 190.162s (~3.17 minutes) | 4 | 802947 | 1 | 134 |
| La Sinvergüenza (feat. Banda MS de Sergio Lizárraga) | Christian Nodal | 0.482 | 9 | -5.352 | 1 | 0.0349 | 0.21100 | 0.00e+00 | 0.3000 | 0.604 | 102.010 | 198.529s (~3.31 minutes) | 4 | 786876 | 1 | 21 |
| Expectativa x Realidade | Matheus & Kauan | 0.839 | 6 | -5.674 | 1 | 0.0758 | 0.18200 | 0.00e+00 | 0.2000 | 0.568 | 123.793 | 163.324s (~2.72 minutes) | 4 | 765597 | 1 | 1 |
| My Universe | Coldplay, BTS | 0.701 | 9 | -6.390 | 1 | 0.0402 | 0.00813 | 0.00e+00 | 0.2000 | 0.443 | 104.988 | 228s (~3.8 minutes) | 4 | 32790116 | 7 | 8 |
| Your Heart | Joyner Lucas, J. Cole | 0.637 | 6 | -7.701 | 0 | 0.1330 | 0.39200 | 0.00e+00 | 0.0929 | 0.754 | 87.006 | 198.621s (~3.31 minutes) | 4 | 7564368 | 7 | 7 |
Let’s get back to the dataframe more related to streaming. “global_streaming”.
We would like to first examine the distributions of numeric variables in this dataset.
I like to use my own functions for repeated visuals.
histogram_plot <- function(df, plot_title, n_bins) {
df %>%
keep(is.numeric) %>%
gather() %>% #creates two columns, key and value
ggplot(aes(value)) +
facet_wrap( ~ key, scales = "free", ncol = 3) +
geom_histogram(bins = n_bins) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma) +
labs(title = plot_title) +
chewyTheme()
}
Plotting all numeric variables relaated to streaming.
histogram_plot(global_streaming %>% select(-year, -rank), "Distributions for Global Top 200", 200)
Plotting the most important variable ‘streams’, that is most likely to be used as the output (Y) variable.
global_streaming %>%
select(streams) %>%
keep(is.numeric) %>%
gather() %>% #creates two columns, key and value
ggplot(aes(value)) +
facet_wrap( ~ key, scales = "free", ncol = 3) +
geom_histogram(bins = 1000) +
scale_x_continuous(labels = scales::comma, breaks = seq(0, 20e6, by = 4e6)) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Distribution of streams in the Global Top 200 from 2017 to Sep' 2021") +
chewyTheme()
The graph above is an important one, each bin in this graph represents a 1000 rows in our dataset. We can see that there’s a huge chunk of tracks that don’t reach 4 million streams in a day.
What we can infer, just by looking at this distribution, is that it resembles the Pareto distribution.
Before jumping in to any kind of modeling, there’s a lot of data exploring left to do. This, for me as an analyst is important, to fully grasp what data is being used and what it may or may not be used for (both ways can be equally important to consider).
For e.g, we know from the start that this dataset represents Spotify’s Top 200 Charts, which mean that even the lowest of streams in our dataset (300k streams), is considered extremely high compared to all the tracks streamed on Spotify on the same day. We are therefore dealing with a dataset that is originally part of a huge sample size.
Let’s check how streams progresses over time, and also how the ranks are distributed as a third variable.
streams_timelineplot(global_streaming,
plot_title = "Daily streams from 2017 - Sep' 2021",
legendshow = FALSE)
One thing we can see, are the clear changes month to month and also more frequent changes. To inspect this further we would could analyze the daily rate of change for streams over the whole dataset. Afterwords we can choose specific cases.
The reasoning behind this is to view the Top 200 as it’s own “market”, analogue to the stock market. Where streams change daily, just as the prices of asset do. The “asset” would then be the track, we can also easily assume that artists reoccur on the Top 200.
Putting it all together we can see the track as an underlying asset to the artist, and thus there would be two “prices”, one that changes daily for the track asset, and one that changes daily for the artist asset.
Let’s call the entire dataframe for the Top 200 “market”: global_market
global_market %>%
head(10) %>%
kbl() %>%
kable_material_dark()
| date | market_streams | market_streams_text | perc_change | perc_text |
|---|---|---|---|---|
| 2017-01-01 | 148613167 | 148.6m | NA | NA |
| 2017-01-02 | 154810836 | 154.8m | 0.0417034 | 4.17% |
| 2017-01-03 | 166239930 | 166.2m | 0.0738262 | 7.38% |
| 2017-01-04 | 169252507 | 169.3m | 0.0181219 | 1.81% |
| 2017-01-05 | 169919094 | 169.9m | 0.0039384 | 0.39% |
| 2017-01-06 | 182321587 | 182.3m | 0.0729906 | 7.30% |
| 2017-01-07 | 177567024 | 177.6m | -0.0260779 | -2.61% |
| 2017-01-08 | 162831491 | 162.8m | -0.0829858 | -8.30% |
| 2017-01-09 | 175235859 | 175.2m | 0.0761792 | 7.62% |
| 2017-01-10 | 178480856 | 178.5m | 0.0185179 | 1.85% |
daily_change_linePlot(global_market, plot_title = "Daily change (%) in streams")
Let’s also build the dataframes for titles and artists.
artists_market <- df %>%
select(artist, title, streams, date) %>%
group_by(artist, date) %>%
summarise(market_streams = sum(streams)) %>%
mutate(market_streams_text = streamLabels(market_streams)) %>%
## daily change in streams for each artist
mutate(perc_change = market_streams/lag(market_streams) - 1,
perc_text = percent(perc_change, accuracy = 0.01)) %>%
ungroup()
## `summarise()` has grouped output by 'artist'. You can override using the `.groups` argument.
title_market <- df %>%
select(title, streams, date) %>%
group_by(title, date) %>%
summarise(market_streams = sum(streams)) %>%
mutate(market_streams_text = streamLabels(market_streams)) %>%
## daily change in streams for each artist
mutate(perc_change = market_streams/lag(market_streams) - 1,
perc_text = percent(perc_change, accuracy = 0.01)) %>%
ungroup()
## `summarise()` has grouped output by 'title'. You can override using the `.groups` argument.
artists_market %>%
sample_n(10) %>%
kbl() %>%
kable_material_dark()
| artist | date | market_streams | market_streams_text | perc_change | perc_text |
|---|---|---|---|---|---|
| Harry Styles | 2021-07-28 | 2443148 | 2.4m | 0.0144396 | 1.44% |
| Piso 21, Christian Nodal | 2020-01-06 | 673783 | 0.7m | 0.0688538 | 6.89% |
| Hailee Steinfeld, BloodPop® | 2018-03-29 | 1133601 | 1.1m | -0.0389856 | -3.90% |
| Megan Thee Stallion | 2021-08-16 | 1065225 | 1.1m | 0.0399633 | 4.00% |
| Marshmello, Khalid | 2018-02-10 | 1501321 | 1.5m | -0.0789651 | -7.90% |
| DJ Snake, J Balvin | 2019-11-03 | 1212624 | 1.2m | -0.1554649 | -15.55% |
| Rudimental, Major Lazer | 2018-07-23 | 584506 | 0.6m | 0.1727884 | 17.28% |
| Charlie Puth | 2017-08-19 | 3249172 | 3.2m | -0.0456526 | -4.57% |
| Jonas Blue, William Singe | 2017-06-08 | 2205729 | 2.2m | 0.0036849 | 0.37% |
| Pop Smoke | 2020-11-12 | 10187585 | 10.2m | -0.0006810 | -0.07% |
Here we can see that for the artist “KAROL G”, they were down -3.65% in streams from the day before.
From here on out we can choose any specific artist/track for further analysis.
Let’s say we would like to give Drake’s marketing team something to work with.
First, we can give them his “portfolio” in the Top 200.
drake_market <- artists_market %>%
filter(artist == "Drake")
drake_market %>%
head(10) %>%
kbl() %>%
kable_material_dark()
| artist | date | market_streams | market_streams_text | perc_change | perc_text |
|---|---|---|---|---|---|
| Drake | 2017-01-01 | 3284077 | 3.3m | NA | NA |
| Drake | 2017-01-02 | 3329204 | 3.3m | 0.0137412 | 1.37% |
| Drake | 2017-01-03 | 3536864 | 3.5m | 0.0623753 | 6.24% |
| Drake | 2017-01-04 | 3568626 | 3.6m | 0.0089803 | 0.90% |
| Drake | 2017-01-05 | 3596688 | 3.6m | 0.0078635 | 0.79% |
| Drake | 2017-01-06 | 3657554 | 3.7m | 0.0169228 | 1.69% |
| Drake | 2017-01-07 | 3558220 | 3.6m | -0.0271586 | -2.72% |
| Drake | 2017-01-08 | 3324323 | 3.3m | -0.0657343 | -6.57% |
| Drake | 2017-01-09 | 3462467 | 3.5m | 0.0415555 | 4.16% |
| Drake | 2017-01-10 | 3517026 | 3.5m | 0.0157573 | 1.58% |
Over a 10 day period, Drake had two days where he lost streams, they were also consecutive. Insights: Another artist, comparative to Drake, may have out-rivaled him those two days by dropping a single or album.
Giving them Drake’s daily change over time shows immense virality in his latest release.
daily_change_linePlot(drake_market,
plot_title = "Drake's daily change (%) in streams") +
daily_change_linePlot(drake_market %>% filter(perc_change < 1),
plot_title = "Drake's daily change (%) in streams < 100% daily change")
Say we want to show them the differnces in four of his biggest hit songs. A function I’ve written for this dataset will extract Drake’s top streamed tracks over the 3 year period.
topkTitle_totalstreams(global_streaming,
sel_artist = "Drake",
k = 4)
## # A tibble: 4 × 13
## title total_streams min_streams max_streams avg_streams min_rank max_rank
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 God's Plan 1419636888 561125 8553009 1979968 1 200
## 2 In My Fee… 844174865 406279 9847333 2300204 1 200
## 3 Toosie Sl… 676623030 670495 6574812 2075531 1 200
## 4 Nice For … 589260626 579247 6443974 1977385 1 200
## # … with 6 more variables: avg_rank <dbl>, streamingrank <int>, year <dbl>,
## # perc <dbl>, perc_text <chr>, streams_text <chr>
Let’s pick these four to compare their rates over time.
daily_change_linePlot(title_market %>% filter(title == "God's Plan"), "God's Plan") +
daily_change_linePlot(title_market %>% filter(title == "In My Feelings"), "In My Feelings") +
daily_change_linePlot(title_market %>% filter(title == "Toosie Slide"), "Toosie Slide") +
daily_change_linePlot(title_market %>% filter(title == "Nice For What"), "Nice For What")
We can also visualize his streams over time, show them his best days, and how he trends over specific ranks, as shown below, I have set a threshold of ranking 3 that are shown in color.
titlesartist_lineplot(df,
sel_artist = "Drake",
n_ranks = 3,
legendshow = TRUE)
We can compare Drake with somebody who also dominates the charts.
titlesartist_lineplot(df,
sel_artist = "Drake",
n_ranks = 1,
legendshow = TRUE) /
titlesartist_lineplot(df,
sel_artist = "Ed Sheeran",
n_ranks = 1,
legendshow = FALSE) /
titlesartist_lineplot(df,
sel_artist = "The Weeknd",
n_ranks = 1,
legendshow = FALSE)
We can also show them his streams over time categorized by months, quarters or years.
topyearsbarplot(topkYear_totalstreams(global_streaming,
sel_artist = "Drake"),
sel_artist = "Drake")
topquartersbarplot(topkQuarter_totalstreams(global_streaming,
sel_artist = "Drake"),
sel_artist = "Drake")
topmonthsbarplot(topkMonth_totalstreams(global_streaming,
sel_artist = "Drake"),
sel_artist = "Drake")
This is all assuming Drake’s team want to increase his size on the charts.
The modeling part is not expanded upon in this project as it would require diving into statistics.
The main part about the modeling is to either go the way of financial modeling, viewing the artist and titles as assets in a portfolio, and modeling thereafter. Or using models such as kNN, OLS or PLS.
*In reality: depending on what data is available and data engineering limits.
This product can be used by artists, managers and record labels to gain insights about how their music is performing on the charts.
Further expansion of the product includes sourcing data from other relevant playlists such as their Viral 50 or “RapCaviar”. But also including Apple Music, Tidal and then possibly connecting to various metrics we can get from Instagram / TikTok.
This project is not for commercial use as it’s intended for presentational purposes only. Any IP created within this notebook belongs to Kareem Elgindy under legal rights to Rising Sun AB}