Data Wrangling Project

Spotify Genre Data

1. Introduction

Spotify is the most popular audio and video streaming service and the highest contributor to the music business today. It’s global audio streaming service has 271 million users, including 124 million subscribers across 79 markets.

Problem Statement

The objective of the project is to explore the Spotify Genre Data to discover the correlation between the various audio features and the genres that exist in spotify. In addition to this, we aim to understand the factors affecting the track popularity and discover the versatility of artists. This would help artists understand better the factors that affect the track popularity, and thereby make music audios and videos accordingly. It would also enable the amateur melophiles in realizing their favorite genres, by evaluating their interests in audio features, like danceability, energy, acousticness etc.

Approach

We would initially perform extensive data cleaning, which would include identifying the missing values, duplicates and handle them accordingly. We would seperate the date of release column, to give us the year value for trend analysis. We would also be using derived attributes to aid our analysis. The various univariate and multivariate audio features would be explored to see if they have an effect on the track popularity and genre classification. The audio features can be broadly classified into confidence measures, perceptual measures and descriptors. One of the key takeaways of this study would recognize the versatality of an artist, depending on the variety of the genres their tracks belong to.

2. Packages Required

The packages that we have used are as follows:

Tidyverse: It was imported majorly for data cleaning purposes and along with itself it also imports other libraries like dplyr(for %>% operator) and tidyr(for separate()).
Rapportools: It consists of helper functions which facilitates creation of reproducible statistical report templates. We imported it to use the is.empty() in order to identify the empty values.
Lubridate: It is a package that eases working with Date and Time datatypes.
Knitr: It enables the integration of R code into R markdown and in our case we used it to display the variables in a neat scrollable tabular format.
DT: Data objects in R can be rendered as HTML by importing this package.

library(tidyverse)
library(rapportools)
library(lubridate)
library(knitr)
library(DT)

3. Data Preparation

3.1 Data Source

The datasource is Spotify Songs from Spotify application via spotifyr package -> Dataset

3.2 Original Dataset

We load the dataset into R studio.

spotify_data <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv",as.is = TRUE)

The original dataset has 32833 rows and 23 columns, which was collected from Every Noise, which is an interesting visualization of the spotify genre-space maintained by a genre taxonomist. The dataset includes 5000 songs for each genre, split across various sub-genre. The main purpose of the original dataset was to explore the following audio features:

Confidence Measures
- Acousticness, liveness, speechiness, instrumentalness
Perceptual Measures
- Energy, Loudness, Danceability and Valence
Descriptors
- Duration, Tempo, Key and Mode

The dataset consists of the following variables:

names(spotify_data)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

3.3 Data Cleaning

Step1: Handling Missing and Empty Values

colSums(is.na(spotify_data))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

spotify_data <- na.omit(spotify_data)

As we can see that the track_name,track_album_name and track_artist variables contain 5 missing values, we decided to remove them since it would hamper our analysis. A total of 5 rows were omitted, which would not have a severe impact on the insights derived from the dataset.

colSums(is.empty(spotify_data))

##                 track_id               track_name             track_artist 
##                        0                        0                        0 
##         track_popularity           track_album_id         track_album_name 
##                     2698                        0                        0 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        1 
##                   energy                      key                 loudness 
##                        0                     3454                        0 
##                     mode              speechiness             acousticness 
##                    14256                        1                        1 
##         instrumentalness                 liveness                  valence 
##                    12085                        1                        1 
##                    tempo              duration_ms 
##                        1                        0

spotify_data <- spotify_data[!is.empty(spotify_data$track_name),]

The variables track_popularity, key, mode, speechiness, acousticness, instrumentalness, liveness,tempo, danceability and valence consists of values ranging from 0.0 to 1.0. The empty values include 0.0 and hence we decided to not remove/impute those values. However, the track_name consists of 5 empty values that we decided to remove.

Step2: Changing the datatypes of certain variables

str(spotify_data)

## 'data.frame':    32828 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
##  - attr(*, "na.action")= 'omit' Named int  8152 9283 9284 19569 19812
##   ..- attr(*, "names")= chr  "8152" "9283" "9284" "19569" ...

spotify_data$track_album_release_date <- as.Date(spotify_data$track_album_release_date)

We observed that the track_album_release_date column was read as a character datatype and had to be changed into a date datatype.

Step3: Splitting the track_album_release_date into day, month and year

tibble(spotify_data$track_album_release_date)

## # A tibble: 32,828 x 1
##    `spotify_data$track_album_release_date`
##    <date>                                 
##  1 2019-06-14                             
##  2 2019-12-13                             
##  3 2019-07-05                             
##  4 2019-07-19                             
##  5 2019-03-05                             
##  6 2019-07-11                             
##  7 2019-07-26                             
##  8 2019-08-29                             
##  9 2019-06-14                             
## 10 2019-06-20                             
## # ... with 32,818 more rows

spotify_data <- spotify_data %>% dplyr::mutate(year = lubridate::year(spotify_data$track_album_release_date), month = lubridate::month(spotify_data$track_album_release_date), day = lubridate::day(spotify_data$track_album_release_date))

We aim at analyzing the trends that the data follows according to the artist name and genre types over the years that it was released in. We thereby split the track_album_release_date into year, month and day

Step4: Selecting the required colums from the dataset

spotify_data <- select(spotify_data,track_artist,year,track_album_name,track_name,track_popularity,everything(),-c(track_id,track_album_id,playlist_id,track_album_release_date,month,day))

We selected all the variables except for track_id,track_album_id and playlist_id and also rearranged the selected columns in a more comprehensible manner.

3.4 Data Preview

The data preview below is the top 100 rows of the cleaned data.

output_data <- head(spotify_data, n = 100)
datatable(output_data, filter = 'top', options = list(pageLength = 25))

Below is the table containing the variable name, its data type and descriptions for the spotify songs dataset.

    variable_name <- colnames(spotify_data)
    variable_descp <- c("Song Artist", "Year when album released",
                        "Song album name",
                        "Song name",
                        "Song Popularity (0-100) where higher is better",
                        "Name of playlist",
                        "Playlist genre",
                        "Playlist subgenre",
                        "Danceability describes how suitable a track is for dancing based on a combination of musical elements. A value of 0.0 is least danceable and 1.0 is most danceable","Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.",
                        "The estimated overall key of the track",
                        "The overall loudness of a track in decibels (dB)",
                        "Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived",
                        "Speechiness detects the presence of spoken words in a track.",
                        "A confidence measure from 0.0 to 1.0 of whether the track is acoustic",
                        "Predicts whether a track contains no vocals. ",
                        "Detects the presence of an audience in the recording",
                        "A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track",
                        "The overall estimated tempo of a track in beats per minute (BPM)",
                        "Duration of song in milliseconds"
                        )
    col_types <- lapply(spotify_data, class)
    variable_summary <- as.data.frame(cbind(variable_name, col_types, variable_descp), row.names = F, )
    colnames(variable_summary) <- c("Variable Name", "Data Type", "Description")
    kable(variable_summary)

Variable Name	Data Type	Description
track_artist	character	Song Artist
year	numeric	Year when album released
track_album_name	character	Song album name
track_name	character	Song name
track_popularity	integer	Song Popularity (0-100) where higher is better
playlist_name	character	Name of playlist
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	numeric	Danceability describes how suitable a track is for dancing based on a combination of musical elements. A value of 0.0 is least danceable and 1.0 is most danceable
energy	numeric	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
key	integer	The estimated overall key of the track
loudness	numeric	The overall loudness of a track in decibels (dB)
mode	integer	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived
speechiness	numeric	Speechiness detects the presence of spoken words in a track.
acousticness	numeric	A confidence measure from 0.0 to 1.0 of whether the track is acoustic
instrumentalness	numeric	Predicts whether a track contains no vocals.
liveness	numeric	Detects the presence of an audience in the recording
valence	numeric	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track
tempo	numeric	The overall estimated tempo of a track in beats per minute (BPM)
duration_ms	integer	Duration of song in milliseconds

4. Proposed Exploratory Data Analysis

We aim to study the different graphs of the audio features and the correlation it has with track popularity and genre classification. The 12 audio features are used to classify the songs into genres, and in order to do we would be using boxplots and histograms to study the distribution of the variables. We also would be using density plots to observe and compare the 12 audio features for different genre.

Data visualization would utilize the ggplot2 package. We would also be building a model to predict the popularity of the track given the audio features that seem to be significant through our exploratory data analysis.