Introduction

This presentation will be for market research agency NEPA, where I will show case an approach to this data. My aim is to communicate what usage this data might have and for who, what tools I will use to navigate myself in this territory and to bring ideas on how to further expand upon the product.

One part of this data is collected from Spotify’s Top 200 charts and include variables such as artist, track title, date, daily streams and daily ranking in the Top 200.

The other part is fetched through Spotify’s Web API, where access to further variables about the artist and track are available, such as tempo, key of the song, speechiness, instrumentalness, etc.

Setup

First we’ll load the libraries needed for this project.

library(tidyverse) ## data viz, data wrangling, etc.
library(tidymodels) ## data modeling
library(zoo) ## rolling averages
library(spotifyr) ## Spotify web API R wrapper
library(kableExtra)

Load plot functions, ggplot2 theming and other libraries from script.

source("appPlotfunctions.R")

Exploratory data analysis (EDA)

Importing data

This data is already tidied data and wrangled from two other scripts. One is the “API-import-wrangle.Rmd”, where I’ve utilised Spotify’s Web API to connect a dataset of Spotify’s Top 200 that was scraped from spotifycharts.com, ranging from 2017-2021 with daily streams and rankings.

Together they create this dataset that I have saved specifically for my thesis project and, will be used for this notebook presentation aswell.

(This has already been csv-imported through “appPlotfunctions.R”)

Inspecting & cleaning the data

As we can see below, this dataset contains 42 columns, so although it is technically tidy, it is not yet useable for any purpose at first glance.

head(globalcharts)

## # A tibble: 6 × 42
##   title   artist  streams  rank date        year quarter month trend min_streams
##   <chr>   <chr>     <dbl> <dbl> <date>     <dbl> <chr>   <chr> <chr>       <dbl>
## 1 STAY (… The Ki… 7714466     1 2021-09-30  2021 2021/Q3 2021… SAME…     5240809
## 2 INDUST… Lil Na… 6517968     2 2021-09-30  2021 2021/Q3 2021… SAME…     4331516
## 3 Heat W… Glass … 4460880     3 2021-09-30  2021 2021/Q3 2021… SAME…      730942
## 4 My Uni… Coldpl… 4142687     4 2021-09-30  2021 2021/Q3 2021… MOVE…     4013065
## 5 Bad Ha… Ed She… 4077321     5 2021-09-30  2021 2021/Q3 2021… MOVE…     3777161
## 6 Pepas   Farruko 3982650     6 2021-09-30  2021 2021/Q3 2021… SAME…      693354
## # … with 32 more variables: max_streams <dbl>, avg_streams <dbl>,
## #   median_streams <dbl>, total_streams <dbl>, min_rank <dbl>, max_rank <dbl>,
## #   avg_rank <dbl>, median_rank <dbl>, days <int>, artist_days <int>,
## #   streamstitle_3dMA <dbl>, streamstitle_7dMA <dbl>, streamstitle_14dMA <dbl>,
## #   streamstitle_30dMA <dbl>, streams_7dMA <dbl>, streams_14dMA <dbl>,
## #   streams_30dMA <dbl>, streams_90dMA <dbl>, danceability <dbl>, energy <dbl>,
## #   key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>, …

First we would like to focus on the variables related to streaming, ‘title’, ‘artist’, ‘streams’, ‘rank’, ‘trend’, streams rolling averages (by rank) and any metrics related to streams or time (dates).

Let’s go ahead and drop the rest of the columns and name our new dataframe “global_streaming”.

global_streaming <- globalcharts %>%
  select(title:artist_days, streams_7dMA, streams_14dMA) %>%
  rename(streamsrank_7dMA = streams_7dMA,
         streamsrank_14dMA = streams_14dMA)

global_streaming %>% 
  head(10) %>% 
  kbl() %>%
  kable_material_dark()

title	artist	streams	rank	date	year	quarter	month	trend	min_streams	max_streams	avg_streams	median_streams	total_streams	min_rank	max_rank	avg_rank	median_rank	days	artist_days	streamsrank_7dMA	streamsrank_14dMA
STAY (with Justin Bieber)	The Kid LAROI	7714466	1	2021-09-30	2021	2021/Q3	2021/9	SAME_POSITION	5240809	10629302	8388760	8716512	704655831	1	4	1	1	84	741	NA	NA
INDUSTRY BABY (feat. Jack Harlow)	Lil Nas X	6517968	2	2021-09-30	2021	2021/Q3	2021/9	SAME_POSITION	4331516	8808346	6089915	6113264	426294063	2	13	3	2	70	1221	NA	NA
Heat Waves	Glass Animals	4460880	3	2021-09-30	2021	2021/Q3	2021/9	SAME_POSITION	730942	4784611	1938349	1774850	552429333	3	200	43	34	285	287	NA	NA
My Universe	Coldplay, BTS	4142687	4	2021-09-30	2021	2021/Q3	2021/9	MOVE_UP	4013065	6768788	4684302	4219891	32790116	3	5	4	3	7	8	NA	NA
Bad Habits	Ed Sheeran	4077321	5	2021-09-30	2021	2021/Q3	2021/9	MOVE_DOWN	3777161	6008275	5006624	5117893	490649138	3	19	4	4	98	9860	NA	NA
Pepas	Farruko	3982650	6	2021-09-30	2021	2021/Q3	2021/9	SAME_POSITION	693354	4991996	3413331	3921750	276479845	3	189	20	8	81	343	NA	NA
Woman	Doja Cat	3905977	7	2021-09-30	2021	2021/Q3	2021/9	SAME_POSITION	405110	4541274	2571863	3017278	234039535	5	200	53	19	91	1675	NA	NA
Shivers	Ed Sheeran	3706870	8	2021-09-30	2021	2021/Q3	2021/9	SAME_POSITION	2755593	4465660	3661475	3697536	76890977	8	13	9	9	21	9860	NA	NA
THATS WHAT I WANT	Lil Nas X	3529010	9	2021-09-30	2021	2021/Q3	2021/9	SAME_POSITION	3333903	5208116	4170450	4247074	58386295	3	9	6	6	14	1221	NA	NA
Beggin’	Måneskin	3367619	10	2021-09-30	2021	2021/Q3	2021/9	SAME_POSITION	721300	8005228	4916102	4978832	599764421	1	195	14	4	122	377	NA	NA

Second, we also want the handle “NA” in the columns for danceability, energy etc. These are audio features and only appear once for each title, instead of filling up the dataset with repeated numbers all over the place, it saves memory to only keep them in the dataset once, we can extract the audio features for each track into it’s own dataframe and name it “global_audiofeatures”.

global_audiofeats <- globalcharts %>%
  select(title, artist, energy:time_signature, total_streams, days, artist_days) %>%
  drop_na()

global_audiofeats %>% 
  head(10) %>% 
  kbl() %>%
  kable_material_dark()

title	artist	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration	time_signature	total_streams	days	artist_days
Last One Standing (feat. Polo G, Mozzy & Eminem)	Skylar Grey	0.692	1	-4.128	0	0.1270	0.23700	0.00e+00	0.0798	0.233	154.924	257.369s (~4.29 minutes)	4	1199762	1	1
fue mejor (feat. SZA)	Kali Uchis	0.384	5	-7.414	0	0.0353	0.23900	2.27e-04	0.0762	0.230	106.002	230.965s (~3.85 minutes)	4	736722	1	228
ONLY	LeeHi	0.296	5	-7.451	1	0.0346	0.89200	0.00e+00	0.0873	0.151	122.907	240.907s (~4.02 minutes)	3	2212553	3	3
Ya Supérame (En Vivo)	Grupo Firme	0.400	0	-6.521	1	0.1100	0.46300	0.00e+00	0.7600	0.568	75.576	189.05s (~3.15 minutes)	3	677450	1	1
LOCO	ITZY	0.886	1	-3.067	1	0.1770	0.01090	2.21e-05	0.3250	0.497	102.012	191.462s (~3.19 minutes)	4	6118531	6	97
A Maior Saudade - Ao Vivo	Henrique & Juliano	0.626	11	-4.780	1	0.0533	0.57100	0.00e+00	0.8700	0.305	162.824	190.162s (~3.17 minutes)	4	802947	1	134
La Sinvergüenza (feat. Banda MS de Sergio Lizárraga)	Christian Nodal	0.482	9	-5.352	1	0.0349	0.21100	0.00e+00	0.3000	0.604	102.010	198.529s (~3.31 minutes)	4	786876	1	21
Expectativa x Realidade	Matheus & Kauan	0.839	6	-5.674	1	0.0758	0.18200	0.00e+00	0.2000	0.568	123.793	163.324s (~2.72 minutes)	4	765597	1	1
My Universe	Coldplay, BTS	0.701	9	-6.390	1	0.0402	0.00813	0.00e+00	0.2000	0.443	104.988	228s (~3.8 minutes)	4	32790116	7	8
Your Heart	Joyner Lucas, J. Cole	0.637	6	-7.701	0	0.1330	0.39200	0.00e+00	0.0929	0.754	87.006	198.621s (~3.31 minutes)	4	7564368	7	7

Distributions

Let’s get back to the dataframe more related to streaming. “global_streaming”.

We would like to first examine the distributions of numeric variables in this dataset.

I like to use my own functions for repeated visuals.

histogram_plot <- function(df, plot_title, n_bins) {
  df %>%
    keep(is.numeric) %>%
    gather() %>% #creates two columns, key and value
    ggplot(aes(value)) +
    facet_wrap( ~ key, scales = "free", ncol = 3) +
    geom_histogram(bins = n_bins) +
    scale_x_continuous(labels = scales::comma) +
    scale_y_continuous(labels = scales::comma) +
    labs(title = plot_title) +
    chewyTheme()
}

Plotting all numeric variables relaated to streaming.

histogram_plot(global_streaming %>% select(-year, -rank), "Distributions for Global Top 200", 200)

Streams distribution

Plotting the most important variable ‘streams’, that is most likely to be used as the output (Y) variable.

global_streaming %>%
    select(streams) %>%
    keep(is.numeric) %>%
    gather() %>% #creates two columns, key and value
    ggplot(aes(value)) +
    facet_wrap( ~ key, scales = "free", ncol = 3) +
    geom_histogram(bins = 1000) +
    scale_x_continuous(labels = scales::comma, breaks = seq(0, 20e6, by = 4e6)) +
    scale_y_continuous(labels = scales::comma) +
    labs(title = "Distribution of streams in the Global Top 200 from 2017 to Sep' 2021") +
    chewyTheme()

The graph above is an important one, each bin in this graph represents a 1000 rows in our dataset. We can see that there’s a huge chunk of tracks that don’t reach 4 million streams in a day.

What we can infer, just by looking at this distribution, is that it resembles the Pareto distribution.

Before jumping in to any kind of modeling, there’s a lot of data exploring left to do. This, for me as an analyst is important, to fully grasp what data is being used and what it may or may not be used for (both ways can be equally important to consider).

For e.g, we know from the start that this dataset represents Spotify’s Top 200 Charts, which mean that even the lowest of streams in our dataset (300k streams), is considered extremely high compared to all the tracks streamed on Spotify on the same day. We are therefore dealing with a dataset that is originally part of a huge sample size.

Time Series

Let’s check how streams progresses over time, and also how the ranks are distributed as a third variable.

streams_timelineplot(global_streaming,
                     plot_title = "Daily streams from 2017 - Sep' 2021",
                     legendshow = FALSE)

Transforming data

One thing we can see, are the clear changes month to month and also more frequent changes. To inspect this further we would could analyze the daily rate of change for streams over the whole dataset. Afterwords we can choose specific cases.

The reasoning behind this is to view the Top 200 as it’s own “market”, analogue to the stock market. Where streams change daily, just as the prices of asset do. The “asset” would then be the track, we can also easily assume that artists reoccur on the Top 200.

Putting it all together we can see the track as an underlying asset to the artist, and thus there would be two “prices”, one that changes daily for the track asset, and one that changes daily for the artist asset.

Let’s call the entire dataframe for the Top 200 “market”: global_market

global_market %>%
  head(10) %>%
  kbl() %>%
  kable_material_dark()

date	market_streams	market_streams_text	perc_change	perc_text
2017-01-01	148613167	148.6m	NA	NA
2017-01-02	154810836	154.8m	0.0417034	4.17%
2017-01-03	166239930	166.2m	0.0738262	7.38%
2017-01-04	169252507	169.3m	0.0181219	1.81%
2017-01-05	169919094	169.9m	0.0039384	0.39%
2017-01-06	182321587	182.3m	0.0729906	7.30%
2017-01-07	177567024	177.6m	-0.0260779	-2.61%
2017-01-08	162831491	162.8m	-0.0829858	-8.30%
2017-01-09	175235859	175.2m	0.0761792	7.62%
2017-01-10	178480856	178.5m	0.0185179	1.85%

daily_change_linePlot(global_market, plot_title = "Daily change (%) in streams")

Let’s also build the dataframes for titles and artists.

artists_market <- df %>%
  select(artist, title, streams, date) %>%
  group_by(artist, date) %>%
  summarise(market_streams = sum(streams)) %>%
  mutate(market_streams_text = streamLabels(market_streams)) %>%
  ## daily change in streams for each artist
  mutate(perc_change = market_streams/lag(market_streams) - 1,
         perc_text = percent(perc_change, accuracy = 0.01)) %>%
  ungroup()

## `summarise()` has grouped output by 'artist'. You can override using the `.groups` argument.

title_market <- df %>%
  select(title, streams, date) %>%
  group_by(title, date) %>%
  summarise(market_streams = sum(streams)) %>%
  mutate(market_streams_text = streamLabels(market_streams)) %>%
  ## daily change in streams for each artist
  mutate(perc_change = market_streams/lag(market_streams) - 1,
         perc_text = percent(perc_change, accuracy = 0.01)) %>%
  ungroup()

## `summarise()` has grouped output by 'title'. You can override using the `.groups` argument.

artists_market %>%
  sample_n(10) %>%
  kbl() %>%
  kable_material_dark()

artist	date	market_streams	market_streams_text	perc_change	perc_text
Harry Styles	2021-07-28	2443148	2.4m	0.0144396	1.44%
Piso 21, Christian Nodal	2020-01-06	673783	0.7m	0.0688538	6.89%
Hailee Steinfeld, BloodPop®	2018-03-29	1133601	1.1m	-0.0389856	-3.90%
Megan Thee Stallion	2021-08-16	1065225	1.1m	0.0399633	4.00%
Marshmello, Khalid	2018-02-10	1501321	1.5m	-0.0789651	-7.90%
DJ Snake, J Balvin	2019-11-03	1212624	1.2m	-0.1554649	-15.55%
Rudimental, Major Lazer	2018-07-23	584506	0.6m	0.1727884	17.28%
Charlie Puth	2017-08-19	3249172	3.2m	-0.0456526	-4.57%
Jonas Blue, William Singe	2017-06-08	2205729	2.2m	0.0036849	0.37%
Pop Smoke	2020-11-12	10187585	10.2m	-0.0006810	-0.07%

Here we can see that for the artist “KAROL G”, they were down -3.65% in streams from the day before.

Artist case

From here on out we can choose any specific artist/track for further analysis.

Let’s say we would like to give Drake’s marketing team something to work with.

First, we can give them his “portfolio” in the Top 200.

drake_market <- artists_market %>%
  filter(artist == "Drake")

drake_market %>%
  head(10) %>%
  kbl() %>%
  kable_material_dark()

artist	date	market_streams	market_streams_text	perc_change	perc_text
Drake	2017-01-01	3284077	3.3m	NA	NA
Drake	2017-01-02	3329204	3.3m	0.0137412	1.37%
Drake	2017-01-03	3536864	3.5m	0.0623753	6.24%
Drake	2017-01-04	3568626	3.6m	0.0089803	0.90%
Drake	2017-01-05	3596688	3.6m	0.0078635	0.79%
Drake	2017-01-06	3657554	3.7m	0.0169228	1.69%
Drake	2017-01-07	3558220	3.6m	-0.0271586	-2.72%
Drake	2017-01-08	3324323	3.3m	-0.0657343	-6.57%
Drake	2017-01-09	3462467	3.5m	0.0415555	4.16%
Drake	2017-01-10	3517026	3.5m	0.0157573	1.58%

Over a 10 day period, Drake had two days where he lost streams, they were also consecutive. Insights: Another artist, comparative to Drake, may have out-rivaled him those two days by dropping a single or album.

Giving them Drake’s daily change over time shows immense virality in his latest release.

daily_change_linePlot(drake_market, 
                      plot_title = "Drake's daily change (%) in streams") + 
  daily_change_linePlot(drake_market %>% filter(perc_change < 1),
                        plot_title = "Drake's daily change (%) in streams < 100% daily change")

Say we want to show them the differnces in four of his biggest hit songs. A function I’ve written for this dataset will extract Drake’s top streamed tracks over the 3 year period.

topkTitle_totalstreams(global_streaming,
                       sel_artist = "Drake",
                       k = 4)

## # A tibble: 4 × 13
##   title      total_streams min_streams max_streams avg_streams min_rank max_rank
##   <fct>              <dbl>       <dbl>       <dbl>       <dbl>    <dbl>    <dbl>
## 1 God's Plan    1419636888      561125     8553009     1979968        1      200
## 2 In My Fee…     844174865      406279     9847333     2300204        1      200
## 3 Toosie Sl…     676623030      670495     6574812     2075531        1      200
## 4 Nice For …     589260626      579247     6443974     1977385        1      200
## # … with 6 more variables: avg_rank <dbl>, streamingrank <int>, year <dbl>,
## #   perc <dbl>, perc_text <chr>, streams_text <chr>

Let’s pick these four to compare their rates over time.

daily_change_linePlot(title_market %>% filter(title == "God's Plan"), "God's Plan") +
daily_change_linePlot(title_market %>% filter(title == "In My Feelings"), "In My Feelings") +
daily_change_linePlot(title_market %>% filter(title == "Toosie Slide"), "Toosie Slide") +
daily_change_linePlot(title_market %>% filter(title == "Nice For What"), "Nice For What")

We can also visualize his streams over time, show them his best days, and how he trends over specific ranks, as shown below, I have set a threshold of ranking 3 that are shown in color.

titlesartist_lineplot(df,
                      sel_artist = "Drake",
                      n_ranks = 3,
                      legendshow = TRUE)

We can compare Drake with somebody who also dominates the charts.

titlesartist_lineplot(df,
                      sel_artist = "Drake",
                      n_ranks = 1,
                      legendshow = TRUE) /
titlesartist_lineplot(df,
                      sel_artist = "Ed Sheeran",
                      n_ranks = 1,
                      legendshow = FALSE) /
titlesartist_lineplot(df,
                      sel_artist = "The Weeknd",
                      n_ranks = 1,
                      legendshow = FALSE)

We can also show them his streams over time categorized by months, quarters or years.

topyearsbarplot(topkYear_totalstreams(global_streaming,
                      sel_artist = "Drake"), 
                sel_artist = "Drake")

topquartersbarplot(topkQuarter_totalstreams(global_streaming,
                      sel_artist = "Drake"), 
                sel_artist = "Drake")

topmonthsbarplot(topkMonth_totalstreams(global_streaming,
                      sel_artist = "Drake"), 
                sel_artist = "Drake")

Case insights

Quarterly and monthly trends
Accumulating streams between releases
Release schedule in comparison to other artists
Diving into audio features and possible correlations
Predicting streams for upcoming releases using predictive analytics
Classifying Drake’s catalogue (by audio features, trends, rankings, periods…)
Using financial modeling to figure out how to “grow his portfolio” over time.

This is all assuming Drake’s team want to increase his size on the charts.

Further discussion

The modeling part is not expanded upon in this project as it would require diving into statistics.
The main part about the modeling is to either go the way of financial modeling, viewing the artist and titles as assets in a portfolio, and modeling thereafter. Or using models such as kNN, OLS or PLS.

*In reality: depending on what data is available and data engineering limits.

This product can be used by artists, managers and record labels to gain insights about how their music is performing on the charts.
Further expansion of the product includes sourcing data from other relevant playlists such as their Viral 50 or “RapCaviar”. But also including Apple Music, Tidal and then possibly connecting to various metrics we can get from Instagram / TikTok.

This project is not for commercial use as it’s intended for presentational purposes only. Any IP created within this notebook belongs to Kareem Elgindy under legal rights to Rising Sun AB}

Spotify Charts Analysis

Kareem Elgindy

1/22/2022