Introduction

This presentation will be for market research agency NEPA, where I will show case an approach to this data. My aim is to communicate what usage this data might have and for who, what tools I will use to navigate myself in this territory and to bring ideas on how to further expand upon the product.

One part of this data is collected from Spotify’s Top 200 charts and include variables such as artist, track title, date, daily streams and daily ranking in the Top 200.

The other part is fetched through Spotify’s Web API, where access to further variables about the artist and track are available, such as tempo, key of the song, speechiness, instrumentalness, etc.

Setup

First we’ll load the libraries needed for this project.

library(tidyverse) ## data viz, data wrangling, etc.
library(tidymodels) ## data modeling
library(zoo) ## rolling averages
library(spotifyr) ## Spotify web API R wrapper
library(kableExtra)

Load plot functions, ggplot2 theming and other libraries from script.

source("appPlotfunctions.R")

Exploratory data analysis (EDA)

Importing data

This data is already tidied data and wrangled from two other scripts. One is the “API-import-wrangle.Rmd”, where I’ve utilised Spotify’s Web API to connect a dataset of Spotify’s Top 200 that was scraped from spotifycharts.com, ranging from 2017-2021 with daily streams and rankings.

Together they create this dataset that I have saved specifically for my thesis project and, will be used for this notebook presentation aswell.

(This has already been csv-imported through “appPlotfunctions.R”)

Inspecting & cleaning the data

As we can see below, this dataset contains 42 columns, so although it is technically tidy, it is not yet useable for any purpose at first glance.

head(globalcharts)
## # A tibble: 6 × 42
##   title   artist  streams  rank date        year quarter month trend min_streams
##   <chr>   <chr>     <dbl> <dbl> <date>     <dbl> <chr>   <chr> <chr>       <dbl>
## 1 STAY (… The Ki… 7714466     1 2021-09-30  2021 2021/Q3 2021… SAME…     5240809
## 2 INDUST… Lil Na… 6517968     2 2021-09-30  2021 2021/Q3 2021… SAME…     4331516
## 3 Heat W… Glass … 4460880     3 2021-09-30  2021 2021/Q3 2021… SAME…      730942
## 4 My Uni… Coldpl… 4142687     4 2021-09-30  2021 2021/Q3 2021… MOVE…     4013065
## 5 Bad Ha… Ed She… 4077321     5 2021-09-30  2021 2021/Q3 2021… MOVE…     3777161
## 6 Pepas   Farruko 3982650     6 2021-09-30  2021 2021/Q3 2021… SAME…      693354
## # … with 32 more variables: max_streams <dbl>, avg_streams <dbl>,
## #   median_streams <dbl>, total_streams <dbl>, min_rank <dbl>, max_rank <dbl>,
## #   avg_rank <dbl>, median_rank <dbl>, days <int>, artist_days <int>,
## #   streamstitle_3dMA <dbl>, streamstitle_7dMA <dbl>, streamstitle_14dMA <dbl>,
## #   streamstitle_30dMA <dbl>, streams_7dMA <dbl>, streams_14dMA <dbl>,
## #   streams_30dMA <dbl>, streams_90dMA <dbl>, danceability <dbl>, energy <dbl>,
## #   key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>, …

First we would like to focus on the variables related to streaming, ‘title’, ‘artist’, ‘streams’, ‘rank’, ‘trend’, streams rolling averages (by rank) and any metrics related to streams or time (dates).

Let’s go ahead and drop the rest of the columns and name our new dataframe “global_streaming”.

global_streaming <- globalcharts %>%
  select(title:artist_days, streams_7dMA, streams_14dMA) %>%
  rename(streamsrank_7dMA = streams_7dMA,
         streamsrank_14dMA = streams_14dMA)
global_streaming %>% 
  head(10) %>% 
  kbl() %>%
  kable_material_dark()
title artist streams rank date year quarter month trend min_streams max_streams avg_streams median_streams total_streams min_rank max_rank avg_rank median_rank days artist_days streamsrank_7dMA streamsrank_14dMA
STAY (with Justin Bieber) The Kid LAROI 7714466 1 2021-09-30 2021 2021/Q3 2021/9 SAME_POSITION 5240809 10629302 8388760 8716512 704655831 1 4 1 1 84 741 NA NA
INDUSTRY BABY (feat. Jack Harlow) Lil Nas X 6517968 2 2021-09-30 2021 2021/Q3 2021/9 SAME_POSITION 4331516 8808346 6089915 6113264 426294063 2 13 3 2 70 1221 NA NA
Heat Waves Glass Animals 4460880 3 2021-09-30 2021 2021/Q3 2021/9 SAME_POSITION 730942 4784611 1938349 1774850 552429333 3 200 43 34 285 287 NA NA
My Universe Coldplay, BTS 4142687 4 2021-09-30 2021 2021/Q3 2021/9 MOVE_UP 4013065 6768788 4684302 4219891 32790116 3 5 4 3 7 8 NA NA
Bad Habits Ed Sheeran 4077321 5 2021-09-30 2021 2021/Q3 2021/9 MOVE_DOWN 3777161 6008275 5006624 5117893 490649138 3 19 4 4 98 9860 NA NA
Pepas Farruko 3982650 6 2021-09-30 2021 2021/Q3 2021/9 SAME_POSITION 693354 4991996 3413331 3921750 276479845 3 189 20 8 81 343 NA NA
Woman Doja Cat 3905977 7 2021-09-30 2021 2021/Q3 2021/9 SAME_POSITION 405110 4541274 2571863 3017278 234039535 5 200 53 19 91 1675 NA NA
Shivers Ed Sheeran 3706870 8 2021-09-30 2021 2021/Q3 2021/9 SAME_POSITION 2755593 4465660 3661475 3697536 76890977 8 13 9 9 21 9860 NA NA
THATS WHAT I WANT Lil Nas X 3529010 9 2021-09-30 2021 2021/Q3 2021/9 SAME_POSITION 3333903 5208116 4170450 4247074 58386295 3 9 6 6 14 1221 NA NA
Beggin’ Måneskin 3367619 10 2021-09-30 2021 2021/Q3 2021/9 SAME_POSITION 721300 8005228 4916102 4978832 599764421 1 195 14 4 122 377 NA NA

Second, we also want the handle “NA” in the columns for danceability, energy etc. These are audio features and only appear once for each title, instead of filling up the dataset with repeated numbers all over the place, it saves memory to only keep them in the dataset once, we can extract the audio features for each track into it’s own dataframe and name it “global_audiofeatures”.

global_audiofeats <- globalcharts %>%
  select(title, artist, energy:time_signature, total_streams, days, artist_days) %>%
  drop_na()
global_audiofeats %>% 
  head(10) %>% 
  kbl() %>%
  kable_material_dark()
title artist energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration time_signature total_streams days artist_days
Last One Standing (feat. Polo G, Mozzy & Eminem) Skylar Grey 0.692 1 -4.128 0 0.1270 0.23700 0.00e+00 0.0798 0.233 154.924 257.369s (~4.29 minutes) 4 1199762 1 1
fue mejor (feat. SZA) Kali Uchis 0.384 5 -7.414 0 0.0353 0.23900 2.27e-04 0.0762 0.230 106.002 230.965s (~3.85 minutes) 4 736722 1 228
ONLY LeeHi 0.296 5 -7.451 1 0.0346 0.89200 0.00e+00 0.0873 0.151 122.907 240.907s (~4.02 minutes) 3 2212553 3 3
Ya Supérame (En Vivo) Grupo Firme 0.400 0 -6.521 1 0.1100 0.46300 0.00e+00 0.7600 0.568 75.576 189.05s (~3.15 minutes) 3 677450 1 1
LOCO ITZY 0.886 1 -3.067 1 0.1770 0.01090 2.21e-05 0.3250 0.497 102.012 191.462s (~3.19 minutes) 4 6118531 6 97
A Maior Saudade - Ao Vivo Henrique & Juliano 0.626 11 -4.780 1 0.0533 0.57100 0.00e+00 0.8700 0.305 162.824 190.162s (~3.17 minutes) 4 802947 1 134
La Sinvergüenza (feat. Banda MS de Sergio Lizárraga) Christian Nodal 0.482 9 -5.352 1 0.0349 0.21100 0.00e+00 0.3000 0.604 102.010 198.529s (~3.31 minutes) 4 786876 1 21
Expectativa x Realidade Matheus & Kauan 0.839 6 -5.674 1 0.0758 0.18200 0.00e+00 0.2000 0.568 123.793 163.324s (~2.72 minutes) 4 765597 1 1
My Universe Coldplay, BTS 0.701 9 -6.390 1 0.0402 0.00813 0.00e+00 0.2000 0.443 104.988 228s (~3.8 minutes) 4 32790116 7 8
Your Heart Joyner Lucas, J. Cole 0.637 6 -7.701 0 0.1330 0.39200 0.00e+00 0.0929 0.754 87.006 198.621s (~3.31 minutes) 4 7564368 7 7

Distributions

Let’s get back to the dataframe more related to streaming. “global_streaming”.

We would like to first examine the distributions of numeric variables in this dataset.

I like to use my own functions for repeated visuals.

histogram_plot <- function(df, plot_title, n_bins) {
  df %>%
    keep(is.numeric) %>%
    gather() %>% #creates two columns, key and value
    ggplot(aes(value)) +
    facet_wrap( ~ key, scales = "free", ncol = 3) +
    geom_histogram(bins = n_bins) +
    scale_x_continuous(labels = scales::comma) +
    scale_y_continuous(labels = scales::comma) +
    labs(title = plot_title) +
    chewyTheme()
} 

Plotting all numeric variables relaated to streaming.

histogram_plot(global_streaming %>% select(-year, -rank), "Distributions for Global Top 200", 200)

Streams distribution

Plotting the most important variable ‘streams’, that is most likely to be used as the output (Y) variable.

global_streaming %>%
    select(streams) %>%
    keep(is.numeric) %>%
    gather() %>% #creates two columns, key and value
    ggplot(aes(value)) +
    facet_wrap( ~ key, scales = "free", ncol = 3) +
    geom_histogram(bins = 1000) +
    scale_x_continuous(labels = scales::comma, breaks = seq(0, 20e6, by = 4e6)) +
    scale_y_continuous(labels = scales::comma) +
    labs(title = "Distribution of streams in the Global Top 200 from 2017 to Sep' 2021") +
    chewyTheme()

The graph above is an important one, each bin in this graph represents a 1000 rows in our dataset. We can see that there’s a huge chunk of tracks that don’t reach 4 million streams in a day.

What we can infer, just by looking at this distribution, is that it resembles the Pareto distribution.

Before jumping in to any kind of modeling, there’s a lot of data exploring left to do. This, for me as an analyst is important, to fully grasp what data is being used and what it may or may not be used for (both ways can be equally important to consider).

For e.g, we know from the start that this dataset represents Spotify’s Top 200 Charts, which mean that even the lowest of streams in our dataset (300k streams), is considered extremely high compared to all the tracks streamed on Spotify on the same day. We are therefore dealing with a dataset that is originally part of a huge sample size.

Time Series

Let’s check how streams progresses over time, and also how the ranks are distributed as a third variable.

streams_timelineplot(global_streaming,
                     plot_title = "Daily streams from 2017 - Sep' 2021",
                     legendshow = FALSE)

Transforming data

One thing we can see, are the clear changes month to month and also more frequent changes. To inspect this further we would could analyze the daily rate of change for streams over the whole dataset. Afterwords we can choose specific cases.

The reasoning behind this is to view the Top 200 as it’s own “market”, analogue to the stock market. Where streams change daily, just as the prices of asset do. The “asset” would then be the track, we can also easily assume that artists reoccur on the Top 200.

Putting it all together we can see the track as an underlying asset to the artist, and thus there would be two “prices”, one that changes daily for the track asset, and one that changes daily for the artist asset.

Let’s call the entire dataframe for the Top 200 “market”: global_market

global_market %>%
  head(10) %>%
  kbl() %>%
  kable_material_dark()
date market_streams market_streams_text perc_change perc_text
2017-01-01 148613167 148.6m NA NA
2017-01-02 154810836 154.8m 0.0417034 4.17%
2017-01-03 166239930 166.2m 0.0738262 7.38%
2017-01-04 169252507 169.3m 0.0181219 1.81%
2017-01-05 169919094 169.9m 0.0039384 0.39%
2017-01-06 182321587 182.3m 0.0729906 7.30%
2017-01-07 177567024 177.6m -0.0260779 -2.61%
2017-01-08 162831491 162.8m -0.0829858 -8.30%
2017-01-09 175235859 175.2m 0.0761792 7.62%
2017-01-10 178480856 178.5m 0.0185179 1.85%
daily_change_linePlot(global_market, plot_title = "Daily change (%) in streams")

Let’s also build the dataframes for titles and artists.

artists_market <- df %>%
  select(artist, title, streams, date) %>%
  group_by(artist, date) %>%
  summarise(market_streams = sum(streams)) %>%
  mutate(market_streams_text = streamLabels(market_streams)) %>%
  ## daily change in streams for each artist
  mutate(perc_change = market_streams/lag(market_streams) - 1,
         perc_text = percent(perc_change, accuracy = 0.01)) %>%
  ungroup()
## `summarise()` has grouped output by 'artist'. You can override using the `.groups` argument.
title_market <- df %>%
  select(title, streams, date) %>%
  group_by(title, date) %>%
  summarise(market_streams = sum(streams)) %>%
  mutate(market_streams_text = streamLabels(market_streams)) %>%
  ## daily change in streams for each artist
  mutate(perc_change = market_streams/lag(market_streams) - 1,
         perc_text = percent(perc_change, accuracy = 0.01)) %>%
  ungroup()
## `summarise()` has grouped output by 'title'. You can override using the `.groups` argument.
artists_market %>%
  sample_n(10) %>%
  kbl() %>%
  kable_material_dark()
artist date market_streams market_streams_text perc_change perc_text
Harry Styles 2021-07-28 2443148 2.4m 0.0144396 1.44%
Piso 21, Christian Nodal 2020-01-06 673783 0.7m 0.0688538 6.89%
Hailee Steinfeld, BloodPop® 2018-03-29 1133601 1.1m -0.0389856 -3.90%
Megan Thee Stallion 2021-08-16 1065225 1.1m 0.0399633 4.00%
Marshmello, Khalid 2018-02-10 1501321 1.5m -0.0789651 -7.90%
DJ Snake, J Balvin 2019-11-03 1212624 1.2m -0.1554649 -15.55%
Rudimental, Major Lazer 2018-07-23 584506 0.6m 0.1727884 17.28%
Charlie Puth 2017-08-19 3249172 3.2m -0.0456526 -4.57%
Jonas Blue, William Singe 2017-06-08 2205729 2.2m 0.0036849 0.37%
Pop Smoke 2020-11-12 10187585 10.2m -0.0006810 -0.07%

Here we can see that for the artist “KAROL G”, they were down -3.65% in streams from the day before.

Artist case

From here on out we can choose any specific artist/track for further analysis.

Let’s say we would like to give Drake’s marketing team something to work with.

First, we can give them his “portfolio” in the Top 200.

drake_market <- artists_market %>%
  filter(artist == "Drake")

drake_market %>%
  head(10) %>%
  kbl() %>%
  kable_material_dark()
artist date market_streams market_streams_text perc_change perc_text
Drake 2017-01-01 3284077 3.3m NA NA
Drake 2017-01-02 3329204 3.3m 0.0137412 1.37%
Drake 2017-01-03 3536864 3.5m 0.0623753 6.24%
Drake 2017-01-04 3568626 3.6m 0.0089803 0.90%
Drake 2017-01-05 3596688 3.6m 0.0078635 0.79%
Drake 2017-01-06 3657554 3.7m 0.0169228 1.69%
Drake 2017-01-07 3558220 3.6m -0.0271586 -2.72%
Drake 2017-01-08 3324323 3.3m -0.0657343 -6.57%
Drake 2017-01-09 3462467 3.5m 0.0415555 4.16%
Drake 2017-01-10 3517026 3.5m 0.0157573 1.58%

Over a 10 day period, Drake had two days where he lost streams, they were also consecutive. Insights: Another artist, comparative to Drake, may have out-rivaled him those two days by dropping a single or album.

Giving them Drake’s daily change over time shows immense virality in his latest release.

daily_change_linePlot(drake_market, 
                      plot_title = "Drake's daily change (%) in streams") + 
  daily_change_linePlot(drake_market %>% filter(perc_change < 1),
                        plot_title = "Drake's daily change (%) in streams < 100% daily change")

Say we want to show them the differnces in four of his biggest hit songs. A function I’ve written for this dataset will extract Drake’s top streamed tracks over the 3 year period.

topkTitle_totalstreams(global_streaming,
                       sel_artist = "Drake",
                       k = 4)
## # A tibble: 4 × 13
##   title      total_streams min_streams max_streams avg_streams min_rank max_rank
##   <fct>              <dbl>       <dbl>       <dbl>       <dbl>    <dbl>    <dbl>
## 1 God's Plan    1419636888      561125     8553009     1979968        1      200
## 2 In My Fee…     844174865      406279     9847333     2300204        1      200
## 3 Toosie Sl…     676623030      670495     6574812     2075531        1      200
## 4 Nice For …     589260626      579247     6443974     1977385        1      200
## # … with 6 more variables: avg_rank <dbl>, streamingrank <int>, year <dbl>,
## #   perc <dbl>, perc_text <chr>, streams_text <chr>

Let’s pick these four to compare their rates over time.

daily_change_linePlot(title_market %>% filter(title == "God's Plan"), "God's Plan") +
daily_change_linePlot(title_market %>% filter(title == "In My Feelings"), "In My Feelings") +
daily_change_linePlot(title_market %>% filter(title == "Toosie Slide"), "Toosie Slide") +
daily_change_linePlot(title_market %>% filter(title == "Nice For What"), "Nice For What")

We can also visualize his streams over time, show them his best days, and how he trends over specific ranks, as shown below, I have set a threshold of ranking 3 that are shown in color.

titlesartist_lineplot(df,
                      sel_artist = "Drake",
                      n_ranks = 3,
                      legendshow = TRUE)

We can compare Drake with somebody who also dominates the charts.

titlesartist_lineplot(df,
                      sel_artist = "Drake",
                      n_ranks = 1,
                      legendshow = TRUE) /
titlesartist_lineplot(df,
                      sel_artist = "Ed Sheeran",
                      n_ranks = 1,
                      legendshow = FALSE) /
titlesartist_lineplot(df,
                      sel_artist = "The Weeknd",
                      n_ranks = 1,
                      legendshow = FALSE)  

We can also show them his streams over time categorized by months, quarters or years.

topyearsbarplot(topkYear_totalstreams(global_streaming,
                      sel_artist = "Drake"), 
                sel_artist = "Drake")

topquartersbarplot(topkQuarter_totalstreams(global_streaming,
                      sel_artist = "Drake"), 
                sel_artist = "Drake")

topmonthsbarplot(topkMonth_totalstreams(global_streaming,
                      sel_artist = "Drake"), 
                sel_artist = "Drake")

Case insights

  • Quarterly and monthly trends
  • Accumulating streams between releases
  • Release schedule in comparison to other artists
  • Diving into audio features and possible correlations
  • Predicting streams for upcoming releases using predictive analytics
  • Classifying Drake’s catalogue (by audio features, trends, rankings, periods…)
  • Using financial modeling to figure out how to “grow his portfolio” over time.

This is all assuming Drake’s team want to increase his size on the charts.

Further discussion

*In reality: depending on what data is available and data engineering limits.

This project is not for commercial use as it’s intended for presentational purposes only. Any IP created within this notebook belongs to Kareem Elgindy under legal rights to Rising Sun AB}