This dataset was accessed via DataCamp’s Server. While DataCamp’s built-in Workspace supports R, I decided to optimize RStudio’s ability to create R Markdown files instead of a Jupyter Notebook just for the sake of it.
The main impetus behind deciding to explore this particular dataset from my background as a musician and as the CEO of a hi-fi audio company. Even though I am a classically trained violinist, one overlapping factor between genres and music has been the ability for it to move, connect and communicate without borders. Ysaye’s Sonatas and, dare I say, Backstreet Boys all have the capability of making the listener feel.
For these reasons, the tone of this study clearly lean towards the informal side; specifically, I’ll be documenting my thought processes and approach in a very colloquial manner.
Despite the levity of the subject matter and my tone, we still will treat the analysis itself seriously, and our process very much remains unchanged.
The main questions I seek to answer from the dataset are as follows:
Which genres are the most popular from the last decade?
H0: No genre is any more popular than the other
HA: One genre is more popular than the other
Does bpm correlate with danceability?
H0: No, there is no correlation between bpm and danceability.
HA: There is a correlation between bpm and danceability.
Does bpm correlate with popularity?
H0: No, there is no correlation between bpm and popularity.
HA: Yes, there is a correlation between bpm and popularity.
Is there a particular duration of song that is the most
popular?
H0: No, there is no duration of song that is the most popular.
HA: Yes, there is an ideal song duration for popularity
First we’ll need to load the data. We’re going to use the readr library for its speed and quickly examine the first 10 columns, and also the overall shape of the DataFrame.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(RColorBrewer)
library(wesanderson)
spotify <- read_csv("spotify.csv")
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 549 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): title, artist, top genre, year
## dbl (9): bpm, nrgy, dnce, dB, live, val, dur, acous, spch
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(spotify, n=10)
## # A tibble: 10 × 14
## title artist top g…¹ year bpm nrgy dnce dB live val dur acous
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "\"Hey,… Train neo me… 2010 97 89 67 -4 8 80 217 19
## 2 "Love T… Eminem detroi… 2010 87 93 75 -5 52 64 263 24
## 3 "TiK To… Kesha dance … 2010 120 84 76 -3 29 71 200 10
## 4 "Bad Ro… Lady … dance … 2010 119 92 70 -4 8 71 295 0
## 5 "Just t… Bruno… pop 2010 109 84 64 -5 9 43 221 2
## 6 "Baby" Justi… canadi… 2010 65 86 73 -5 11 54 214 4
## 7 "Dynami… Taio … dance … 2010 120 78 75 -4 4 82 203 0
## 8 "Secret… OneRe… dance … 2010 148 76 52 -6 12 38 225 7
## 9 "Empire… Alici… hip pop 2010 93 37 48 -8 12 14 216 74
## 10 "Only G… Rihan… barbad… 2010 126 72 79 -4 7 61 235 13
## # … with 2 more variables: spch <dbl>, pop <dbl>, and abbreviated variable name
## # ¹`top genre`
## # ℹ Use `colnames()` to see all variable names
cat(dim(spotify))
## 549 14
It seems there’s only 549 rows across 14 columns, which is less data than I thought but still a decent amount to work with.
Data is considered tidy format, with each row pertaining to one song, and each column pertaining to an aspect of that song (e.g. title, artist, year, etc.).
Let’s look at each column in detail.
colnames(spotify)
## [1] "title" "artist" "top genre" "year" "bpm" "nrgy"
## [7] "dnce" "dB" "live" "val" "dur" "acous"
## [13] "spch" "pop"
There are 14 columns. They’re broken down as follows:
In order to answer the questions we have, we’d like the following columns (n=10): [‘title’,‘artist’,‘genre’,‘year’,‘bpm’, ‘dnce’, ‘db’, `‘val’,‘dur’,‘pop’]
Let’s slice the DataFrame to be one with just the columns we need. It’s important to note here that R does not use zero-indexing, so column 1 is 1 and not 0, and so on.
spotify_slice <- spotify[,c(1,2,3,4,5,7,8,10,11,14)]
print(spotify_slice)
## # A tibble: 549 × 10
## title artist top g…¹ year bpm dnce dB val dur pop
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "\"Hey, Soul Sister… Train neo me… 2010 97 67 -4 80 217 83
## 2 "Love The Way You L… Eminem detroi… 2010 87 75 -5 64 263 82
## 3 "TiK ToK" Kesha dance … 2010 120 76 -3 71 200 80
## 4 "Bad Romance" Lady … dance … 2010 119 70 -4 71 295 79
## 5 "Just the Way You A… Bruno… pop 2010 109 64 -5 43 221 78
## 6 "Baby" Justi… canadi… 2010 65 73 -5 54 214 77
## 7 "Dynamite" Taio … dance … 2010 120 75 -4 82 203 77
## 8 "Secrets" OneRe… dance … 2010 148 52 -6 38 225 77
## 9 "Empire State of Mi… Alici… hip pop 2010 93 48 -8 14 216 76
## 10 "Only Girl (In The … Rihan… barbad… 2010 126 79 -4 61 235 73
## # … with 539 more rows, and abbreviated variable name ¹`top genre`
## # ℹ Use `print(n = ...)` to see more rows
It seems like the data is mostly clean (at least based on the limited view we’ve seen so far). Let’s take a look at whether or not duplicates in the title column or if null values exist anywhere.
spotify[duplicated(spotify_slice[,1]),]
## # A tibble: 13 × 14
## title artist top g…¹ year bpm nrgy dnce dB live val dur acous
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Just th… Bruno… pop 2011 109 84 64 -5 9 43 221 2
## 2 Marry Y… Bruno… pop 2011 145 83 62 -5 10 48 230 33
## 3 Written… Tinie… dance … 2011 91 95 64 -4 18 57 220 6
## 4 Castle … T.I. atl hi… 2011 80 86 45 -5 26 58 329 7
## 5 We Are … Taylo… pop 2013 86 68 63 -6 12 75 193 1
## 6 A Littl… Fergie dance … 2014 130 62 76 -6 9 52 241 1
## 7 Company Justi… canadi… 2016 95 80 59 -5 8 43 208 13
## 8 Runnin'… Naugh… tropic… 2016 139 85 32 -6 48 8 213 1
## 9 Here Aless… canadi… 2016 120 82 38 -4 8 33 199 8
## 10 All I A… Adele britis… 2017 142 28 59 -5 15 34 272 88
## 11 I Like … Cardi… pop 2018 136 73 82 -4 37 65 253 10
## 12 First T… Kygo edm 2018 90 59 63 -7 10 68 194 20
## 13 Kissing… DNCE dance … 2018 120 74 77 -6 9 86 202 5
## # … with 2 more variables: spch <dbl>, pop <dbl>, and abbreviated variable name
## # ¹`top genre`
## # ℹ Use `colnames()` to see all variable names
Well well, looks like there’s a few duplicate values (13). Let’s remove these by creating a new variable of a dataframe without these.
spotify_slice_no_dup <- spotify_slice[!duplicated(spotify_slice[,1]),]
print(spotify_slice_no_dup)
## # A tibble: 536 × 10
## title artist top g…¹ year bpm dnce dB val dur pop
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "\"Hey, Soul Sister… Train neo me… 2010 97 67 -4 80 217 83
## 2 "Love The Way You L… Eminem detroi… 2010 87 75 -5 64 263 82
## 3 "TiK ToK" Kesha dance … 2010 120 76 -3 71 200 80
## 4 "Bad Romance" Lady … dance … 2010 119 70 -4 71 295 79
## 5 "Just the Way You A… Bruno… pop 2010 109 64 -5 43 221 78
## 6 "Baby" Justi… canadi… 2010 65 73 -5 54 214 77
## 7 "Dynamite" Taio … dance … 2010 120 75 -4 82 203 77
## 8 "Secrets" OneRe… dance … 2010 148 52 -6 38 225 77
## 9 "Empire State of Mi… Alici… hip pop 2010 93 48 -8 14 216 76
## 10 "Only Girl (In The … Rihan… barbad… 2010 126 79 -4 61 235 73
## # … with 526 more rows, and abbreviated variable name ¹`top genre`
## # ℹ Use `print(n = ...)` to see more rows
We now have 536 rows. Let’s check to see if there are any NaN values.
summary(is.na(spotify_slice_no_dup))
## title artist top genre year
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:536 FALSE:536 FALSE:536 FALSE:536
## bpm dnce dB val
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:536 FALSE:536 FALSE:536 FALSE:536
## dur pop
## Mode :logical Mode :logical
## FALSE:536 FALSE:536
Running summary on is.na() demonstrates that all columns have all values. That makes life a little easier.
Let’s now try to answer the first question:
##1) Which genres are the most popular from the last decade?
In order to do so, we will sort the dataset based on popularity, and then genre
spotify_top_genres <- table(spotify_slice_no_dup['top genre'])
spotify_top_genres_sorted <- sort(spotify_top_genres, decreasing = TRUE)
spotify_top_genres_sorted_abridged <- tail(sort(spotify_top_genres_sorted, decreasing = FALSE),10)
Wow, it looks like dance pop is the most popular genre from 2010-2019, with nearly 300 (298) occurrences! Don’t forget the entire cleaned 2010-2019 dataset only has 536 rows!
Let’s visualize that in a graph.
#This loads the color package so we can add color to the barplot
library(RColorBrewer)
library(wesanderson)
coul <- brewer.pal(5, "Set2")
#This creates a horizontal bar plot with col colors set to the palette and labels set horizontally
barplot(spotify_top_genres_sorted_abridged, xlab="Genres", col=coul, horiz=TRUE, las = 1)
Next, let’s look at if bpm correlates with danceability.
Once, again let’s visualize that. In order to do so, we will need the help of the package ggpubr
library('ggpubr')
## Loading required package: ggplot2
ggscatter(spotify_slice_no_dup, x='bpm',y='dnce', add ='reg.line',conf.int = FALSE, cor.coef=TRUE, cor.method='pearson', xlab='BPM', ylab='Danceability')
## `geom_smooth()` using formula 'y ~ x'
Wait a second here! According to the data, there’s a song with bpm that’s nearly 2000 and popularity over 1500. Clearly there’s something wrong here! Let’s remove this outlier.
#removing the outlier where bpm is nearly 2000
spotify_slice_no_dup_filtered <- spotify_slice_no_dup[spotify_slice_no_dup[,'bpm']<500,]
print(spotify_slice_no_dup_filtered['bpm'])
## # A tibble: 535 × 1
## bpm
## <dbl>
## 1 97
## 2 87
## 3 120
## 4 119
## 5 109
## 6 65
## 7 120
## 8 148
## 9 93
## 10 126
## # … with 525 more rows
## # ℹ Use `print(n = ...)` to see more rows
Let’s try plotting again.
library('ggpubr')
ggscatter(spotify_slice_no_dup_filtered, x='bpm',y='dnce', add ='reg.line',conf.int = TRUE, cor.coef=TRUE, cor.method='pearson', xlab='BPM', ylab='Danceability')
## `geom_smooth()` using formula 'y ~ x'
That looks more correct! It seems as if whether or not a song is fast or slow, has little to do with how danceable it is! That makes sense. There’s slow dances as well as fast dances!
Once again let’s use a scatterplot and visualize this graph.
library('ggpubr')
ggscatter(spotify_slice_no_dup_filtered, x='bpm',y='pop', add ='reg.line',conf.int = TRUE, cor.coef=TRUE, cor.method='pearson', xlab='BPM', ylab='Popularity')
## `geom_smooth()` using formula 'y ~ x'
It also seems as if song temp doesn’t seem to matter when it comes to popularity, though most songs seem to cluster around 100 to 150 bpm.
This time we’ll plot ‘duration’ (measured in seconds) against ‘pop’.
library('ggpubr')
ggscatter(spotify_slice_no_dup_filtered, x='dur',y='pop', xlab='Song Duration (seconds)', add ='reg.line',conf.int = TRUE, cor.coef=TRUE, cor.method='pearson', ylab='Popularity')
## `geom_smooth()` using formula 'y ~ x'
It seems the most popular songs are clustered around 3-4 minutes. There’s also a very weak correlation between song length and popularity.
Based on our examination of the dataset, it seems like dance pop dominated Spotify’s billboards from 2010 to 2019. Bear in mind, this doesn’t necessarily mean that dance pop is the best genre as it could be that people who listen to Spotify prefer dance pop. For example, a classical music station like KUSC would likely attract and thus have a higher proportion of classical music fans.
Secondly, it seems like tempo (bpm) has little influence on a song’s popularity. This fell within expectations as people like to slow dance (ballads) as well as dance to fast music.
Thirdly, people did not tend to prefer fast music versus slow. It’s important to note here that because there were so many songs that fell within the ‘dance pop’ genre, that the data could be artificially skewed as such. With that said, the bulk of songs in the dataset were around the 100 to 130 bpm.
Lastly, song duration and popularity were very weakly negatively correlated (r=-0.11). This means that songs that tend to go lower slightly trend towards have lower popularity. However, most music fit within 3-4 minutes, likely because of the influence from dance pop.
In all, though things definitely are not set in stone, chances are that if you’re going to listen to a top song on Spotify, it’ll likely be a 3-4 minute dance pop song at around 100bpm (Allegretto)!