Spotify music

Spotify Popularity from 2010 - 2019

This dataset was accessed via DataCamp’s Server. While DataCamp’s built-in Workspace supports R, I decided to optimize RStudio’s ability to create R Markdown files instead of a Jupyter Notebook just for the sake of it.

The main impetus behind deciding to explore this particular dataset from my background as a musician and as the CEO of a hi-fi audio company. Even though I am a classically trained violinist, one overlapping factor between genres and music has been the ability for it to move, connect and communicate without borders. Ysaye’s Sonatas and, dare I say, Backstreet Boys all have the capability of making the listener feel.

For these reasons, the tone of this study clearly lean towards the informal side; specifically, I’ll be documenting my thought processes and approach in a very colloquial manner.

Despite the levity of the subject matter and my tone, we still will treat the analysis itself seriously, and our process very much remains unchanged.

The main questions I seek to answer from the dataset are as follows:

Which genres are the most popular from the last decade?

H0: No genre is any more popular than the other
HA: One genre is more popular than the other
Does bpm correlate with danceability?

H0: No, there is no correlation between bpm and danceability.

HA: There is a correlation between bpm and danceability.
Does bpm correlate with popularity?

H0: No, there is no correlation between bpm and popularity.

HA: Yes, there is a correlation between bpm and popularity.
Is there a particular duration of song that is the most popular?

H0: No, there is no duration of song that is the most popular.

HA: Yes, there is an ideal song duration for popularity

First we’ll need to load the data. We’re going to use the readr library for its speed and quickly examine the first 10 columns, and also the overall shape of the DataFrame.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(RColorBrewer)
library(wesanderson)
spotify <- read_csv("spotify.csv")

## Warning: One or more parsing issues, see `problems()` for details

## Rows: 549 Columns: 14

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): title, artist, top genre, year
## dbl (9): bpm, nrgy, dnce, dB, live, val, dur, acous, spch
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(spotify, n=10)

## # A tibble: 10 × 14
##    title    artist top g…¹ year    bpm  nrgy  dnce    dB  live   val   dur acous
##    <chr>    <chr>  <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 "\"Hey,… Train  neo me… 2010     97    89    67    -4     8    80   217    19
##  2 "Love T… Eminem detroi… 2010     87    93    75    -5    52    64   263    24
##  3 "TiK To… Kesha  dance … 2010    120    84    76    -3    29    71   200    10
##  4 "Bad Ro… Lady … dance … 2010    119    92    70    -4     8    71   295     0
##  5 "Just t… Bruno… pop     2010    109    84    64    -5     9    43   221     2
##  6 "Baby"   Justi… canadi… 2010     65    86    73    -5    11    54   214     4
##  7 "Dynami… Taio … dance … 2010    120    78    75    -4     4    82   203     0
##  8 "Secret… OneRe… dance … 2010    148    76    52    -6    12    38   225     7
##  9 "Empire… Alici… hip pop 2010     93    37    48    -8    12    14   216    74
## 10 "Only G… Rihan… barbad… 2010    126    72    79    -4     7    61   235    13
## # … with 2 more variables: spch <dbl>, pop <dbl>, and abbreviated variable name
## #   ¹`top genre`
## # ℹ Use `colnames()` to see all variable names

cat(dim(spotify))

## 549 14

It seems there’s only 549 rows across 14 columns, which is less data than I thought but still a decent amount to work with.

Data is considered tidy format, with each row pertaining to one song, and each column pertaining to an aspect of that song (e.g. title, artist, year, etc.).

Let’s look at each column in detail.

colnames(spotify)

##  [1] "title"     "artist"    "top genre" "year"      "bpm"       "nrgy"     
##  [7] "dnce"      "dB"        "live"      "val"       "dur"       "acous"    
## [13] "spch"      "pop"

There are 14 columns. They’re broken down as follows:

Title - Title of the song
Artist - Artist of the song
Top Genre - The genre the song falls under
Year - The year the song was released
BPM - Beats per minute, which is a standard measure of tempo (speed) in a song.
NRGY - Energy, a subjective measure of a song (higher is faster/louder)
dnce - Spotify’s measure of how danceable a song is (higher is more, lower is less)
dB - the actual loudness of the song (measured in decibels)
live - liveness, the chance the song was recorded live
val - Valence, how happy a song is (higher is happy, lower is sad)
dur - the duration of the song
cous - the likeliness the song is acoustic
spch - how much the song has spoken words (higher is more spoken words
pop - Spotify’s popularity measure of a song (higher values is more popular)

In order to answer the questions we have, we’d like the following columns (n=10): [‘title’,‘artist’,‘genre’,‘year’,‘bpm’, ‘dnce’, ‘db’, `‘val’,‘dur’,‘pop’]

Let’s slice the DataFrame to be one with just the columns we need. It’s important to note here that R does not use zero-indexing, so column 1 is 1 and not 0, and so on.

spotify_slice <- spotify[,c(1,2,3,4,5,7,8,10,11,14)]
print(spotify_slice)

## # A tibble: 549 × 10
##    title                artist top g…¹ year    bpm  dnce    dB   val   dur   pop
##    <chr>                <chr>  <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 "\"Hey, Soul Sister… Train  neo me… 2010     97    67    -4    80   217    83
##  2 "Love The Way You L… Eminem detroi… 2010     87    75    -5    64   263    82
##  3 "TiK ToK"            Kesha  dance … 2010    120    76    -3    71   200    80
##  4 "Bad Romance"        Lady … dance … 2010    119    70    -4    71   295    79
##  5 "Just the Way You A… Bruno… pop     2010    109    64    -5    43   221    78
##  6 "Baby"               Justi… canadi… 2010     65    73    -5    54   214    77
##  7 "Dynamite"           Taio … dance … 2010    120    75    -4    82   203    77
##  8 "Secrets"            OneRe… dance … 2010    148    52    -6    38   225    77
##  9 "Empire State of Mi… Alici… hip pop 2010     93    48    -8    14   216    76
## 10 "Only Girl (In The … Rihan… barbad… 2010    126    79    -4    61   235    73
## # … with 539 more rows, and abbreviated variable name ¹`top genre`
## # ℹ Use `print(n = ...)` to see more rows

It seems like the data is mostly clean (at least based on the limited view we’ve seen so far). Let’s take a look at whether or not duplicates in the title column or if null values exist anywhere.

spotify[duplicated(spotify_slice[,1]),]

## # A tibble: 13 × 14
##    title    artist top g…¹ year    bpm  nrgy  dnce    dB  live   val   dur acous
##    <chr>    <chr>  <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Just th… Bruno… pop     2011    109    84    64    -5     9    43   221     2
##  2 Marry Y… Bruno… pop     2011    145    83    62    -5    10    48   230    33
##  3 Written… Tinie… dance … 2011     91    95    64    -4    18    57   220     6
##  4 Castle … T.I.   atl hi… 2011     80    86    45    -5    26    58   329     7
##  5 We Are … Taylo… pop     2013     86    68    63    -6    12    75   193     1
##  6 A Littl… Fergie dance … 2014    130    62    76    -6     9    52   241     1
##  7 Company  Justi… canadi… 2016     95    80    59    -5     8    43   208    13
##  8 Runnin'… Naugh… tropic… 2016    139    85    32    -6    48     8   213     1
##  9 Here     Aless… canadi… 2016    120    82    38    -4     8    33   199     8
## 10 All I A… Adele  britis… 2017    142    28    59    -5    15    34   272    88
## 11 I Like … Cardi… pop     2018    136    73    82    -4    37    65   253    10
## 12 First T… Kygo   edm     2018     90    59    63    -7    10    68   194    20
## 13 Kissing… DNCE   dance … 2018    120    74    77    -6     9    86   202     5
## # … with 2 more variables: spch <dbl>, pop <dbl>, and abbreviated variable name
## #   ¹`top genre`
## # ℹ Use `colnames()` to see all variable names

Well well, looks like there’s a few duplicate values (13). Let’s remove these by creating a new variable of a dataframe without these.

spotify_slice_no_dup <- spotify_slice[!duplicated(spotify_slice[,1]),]
print(spotify_slice_no_dup)

## # A tibble: 536 × 10
##    title                artist top g…¹ year    bpm  dnce    dB   val   dur   pop
##    <chr>                <chr>  <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 "\"Hey, Soul Sister… Train  neo me… 2010     97    67    -4    80   217    83
##  2 "Love The Way You L… Eminem detroi… 2010     87    75    -5    64   263    82
##  3 "TiK ToK"            Kesha  dance … 2010    120    76    -3    71   200    80
##  4 "Bad Romance"        Lady … dance … 2010    119    70    -4    71   295    79
##  5 "Just the Way You A… Bruno… pop     2010    109    64    -5    43   221    78
##  6 "Baby"               Justi… canadi… 2010     65    73    -5    54   214    77
##  7 "Dynamite"           Taio … dance … 2010    120    75    -4    82   203    77
##  8 "Secrets"            OneRe… dance … 2010    148    52    -6    38   225    77
##  9 "Empire State of Mi… Alici… hip pop 2010     93    48    -8    14   216    76
## 10 "Only Girl (In The … Rihan… barbad… 2010    126    79    -4    61   235    73
## # … with 526 more rows, and abbreviated variable name ¹`top genre`
## # ℹ Use `print(n = ...)` to see more rows

We now have 536 rows. Let’s check to see if there are any NaN values.

summary(is.na(spotify_slice_no_dup))

##    title           artist        top genre          year        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:536       FALSE:536       FALSE:536       FALSE:536      
##     bpm             dnce             dB             val         
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:536       FALSE:536       FALSE:536       FALSE:536      
##     dur             pop         
##  Mode :logical   Mode :logical  
##  FALSE:536       FALSE:536

Running summary on is.na() demonstrates that all columns have all values. That makes life a little easier.

Let’s now try to answer the first question:

##1) Which genres are the most popular from the last decade?

In order to do so, we will sort the dataset based on popularity, and then genre

spotify_top_genres <- table(spotify_slice_no_dup['top genre'])
spotify_top_genres_sorted <- sort(spotify_top_genres, decreasing = TRUE)
spotify_top_genres_sorted_abridged <- tail(sort(spotify_top_genres_sorted, decreasing = FALSE),10)

Wow, it looks like dance pop is the most popular genre from 2010-2019, with nearly 300 (298) occurrences! Don’t forget the entire cleaned 2010-2019 dataset only has 536 rows!

Let’s visualize that in a graph.

#This loads the color package so we can add color to the barplot
library(RColorBrewer)
library(wesanderson)
coul <- brewer.pal(5, "Set2") 
#This creates a horizontal bar plot with col colors set to the palette and labels set horizontally
barplot(spotify_top_genres_sorted_abridged, xlab="Genres", col=coul, horiz=TRUE, las = 1)

2) Does bpm correlate with danceability?

Next, let’s look at if bpm correlates with danceability.

Once, again let’s visualize that. In order to do so, we will need the help of the package ggpubr

library('ggpubr')

## Loading required package: ggplot2

ggscatter(spotify_slice_no_dup, x='bpm',y='dnce', add ='reg.line',conf.int = FALSE, cor.coef=TRUE, cor.method='pearson', xlab='BPM', ylab='Danceability')

## `geom_smooth()` using formula 'y ~ x'

Wait a second here! According to the data, there’s a song with bpm that’s nearly 2000 and popularity over 1500. Clearly there’s something wrong here! Let’s remove this outlier.

#removing the outlier where bpm is nearly 2000
spotify_slice_no_dup_filtered <- spotify_slice_no_dup[spotify_slice_no_dup[,'bpm']<500,]
print(spotify_slice_no_dup_filtered['bpm'])

## # A tibble: 535 × 1
##      bpm
##    <dbl>
##  1    97
##  2    87
##  3   120
##  4   119
##  5   109
##  6    65
##  7   120
##  8   148
##  9    93
## 10   126
## # … with 525 more rows
## # ℹ Use `print(n = ...)` to see more rows

Let’s try plotting again.

library('ggpubr')
ggscatter(spotify_slice_no_dup_filtered, x='bpm',y='dnce', add ='reg.line',conf.int = TRUE, cor.coef=TRUE, cor.method='pearson', xlab='BPM', ylab='Danceability')

## `geom_smooth()` using formula 'y ~ x'

That looks more correct! It seems as if whether or not a song is fast or slow, has little to do with how danceable it is! That makes sense. There’s slow dances as well as fast dances!

3) Does bpm correlate with popularity?

Once again let’s use a scatterplot and visualize this graph.

library('ggpubr')
ggscatter(spotify_slice_no_dup_filtered, x='bpm',y='pop', add ='reg.line',conf.int = TRUE, cor.coef=TRUE, cor.method='pearson', xlab='BPM', ylab='Popularity')

## `geom_smooth()` using formula 'y ~ x'

It also seems as if song temp doesn’t seem to matter when it comes to popularity, though most songs seem to cluster around 100 to 150 bpm.

4) Is there a particular duration of song that is the most popular?

This time we’ll plot ‘duration’ (measured in seconds) against ‘pop’.

library('ggpubr')
ggscatter(spotify_slice_no_dup_filtered, x='dur',y='pop', xlab='Song Duration (seconds)', add ='reg.line',conf.int = TRUE, cor.coef=TRUE, cor.method='pearson', ylab='Popularity')

## `geom_smooth()` using formula 'y ~ x'

It seems the most popular songs are clustered around 3-4 minutes. There’s also a very weak correlation between song length and popularity.

Conclusions

Based on our examination of the dataset, it seems like dance pop dominated Spotify’s billboards from 2010 to 2019. Bear in mind, this doesn’t necessarily mean that dance pop is the best genre as it could be that people who listen to Spotify prefer dance pop. For example, a classical music station like KUSC would likely attract and thus have a higher proportion of classical music fans.

Secondly, it seems like tempo (bpm) has little influence on a song’s popularity. This fell within expectations as people like to slow dance (ballads) as well as dance to fast music.

Thirdly, people did not tend to prefer fast music versus slow. It’s important to note here that because there were so many songs that fell within the ‘dance pop’ genre, that the data could be artificially skewed as such. With that said, the bulk of songs in the dataset were around the 100 to 130 bpm.

Lastly, song duration and popularity were very weakly negatively correlated (r=-0.11). This means that songs that tend to go lower slightly trend towards have lower popularity. However, most music fit within 3-4 minutes, likely because of the influence from dance pop.

In all, though things definitely are not set in stone, chances are that if you’re going to listen to a top song on Spotify, it’ll likely be a 3-4 minute dance pop song at around 100bpm (Allegretto)!