Spotify is the most popular audio and video streaming service and the highest contributor to the music business today. It’s global audio streaming service has 271 million users, including 124 million subscribers across 79 markets.
Problem Statement
The objective of the project is to explore the Spotify Genre Data to discover the correlation between the various audio features and the genres that exist in spotify. In addition to this, we aim to understand the factors affecting the track popularity and discover the versatility of artists. This would help artists understand better the factors that affect the track popularity, and thereby make music audios and videos accordingly. It would also enable the amateur melophiles in realizing their favorite genres, by evaluating their interests in audio features, like danceability, energy, acousticness etc.
Approach
We would initially perform extensive data cleaning, which would include identifying the missing values, duplicates and handle them accordingly. We would seperate the date of release column, to give us the year value for trend analysis. We would also be using derived attributes to aid our analysis. The various univariate and multivariate audio features would be explored to see if they have an effect on the track popularity and genre classification. The audio features can be broadly classified into confidence measures, perceptual measures and descriptors. One of the key takeaways of this study would recognize the versatality of an artist, depending on the variety of the genres their tracks belong to.
The packages that we have used are as follows:
Tidyverse: It was imported majorly for data cleaning purposes and along with itself it also imports other libraries like dplyr(for %>% operator) and tidyr(for separate()).
Rapportools: It consists of helper functions which facilitates creation of reproducible statistical report templates. We imported it to use the is.empty() in order to identify the empty values.
Lubridate: It is a package that eases working with Date and Time datatypes.
Knitr: It enables the integration of R code into R markdown and in our case we used it to display the variables in a neat scrollable tabular format.
DT: Data objects in R can be rendered as HTML by importing this package.
library(tidyverse)
library(rapportools)
library(lubridate)
library(knitr)
library(DT)
The datasource is Spotify Songs from Spotify application via spotifyr package -> Dataset
We load the dataset into R studio.
spotify_data <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv",as.is = TRUE)
The original dataset has 32833 rows and 23 columns, which was collected from Every Noise, which is an interesting visualization of the spotify genre-space maintained by a genre taxonomist. The dataset includes 5000 songs for each genre, split across various sub-genre. The main purpose of the original dataset was to explore the following audio features:
The dataset consists of the following variables:
names(spotify_data)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
Step1: Handling Missing and Empty Values
colSums(is.na(spotify_data))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
spotify_data <- na.omit(spotify_data)
As we can see that the track_name,track_album_name and track_artist variables contain 5 missing values, we decided to remove them since it would hamper our analysis. A total of 5 rows were omitted, which would not have a severe impact on the insights derived from the dataset.
colSums(is.empty(spotify_data))
## track_id track_name track_artist
## 0 0 0
## track_popularity track_album_id track_album_name
## 2698 0 0
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 1
## energy key loudness
## 0 3454 0
## mode speechiness acousticness
## 14256 1 1
## instrumentalness liveness valence
## 12085 1 1
## tempo duration_ms
## 1 0
spotify_data <- spotify_data[!is.empty(spotify_data$track_name),]
The variables track_popularity, key, mode, speechiness, acousticness, instrumentalness, liveness,tempo, danceability and valence consists of values ranging from 0.0 to 1.0. The empty values include 0.0 and hence we decided to not remove/impute those values. However, the track_name consists of 5 empty values that we decided to remove.
Step2: Changing the datatypes of certain variables
str(spotify_data)
## 'data.frame': 32828 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
## - attr(*, "na.action")= 'omit' Named int 8152 9283 9284 19569 19812
## ..- attr(*, "names")= chr "8152" "9283" "9284" "19569" ...
spotify_data$track_album_release_date <- as.Date(spotify_data$track_album_release_date)
We observed that the track_album_release_date column was read as a character datatype and had to be changed into a date datatype.
Step3: Splitting the track_album_release_date into day, month and year
tibble(spotify_data$track_album_release_date)
## # A tibble: 32,828 x 1
## `spotify_data$track_album_release_date`
## <date>
## 1 2019-06-14
## 2 2019-12-13
## 3 2019-07-05
## 4 2019-07-19
## 5 2019-03-05
## 6 2019-07-11
## 7 2019-07-26
## 8 2019-08-29
## 9 2019-06-14
## 10 2019-06-20
## # ... with 32,818 more rows
spotify_data <- spotify_data %>% dplyr::mutate(year = lubridate::year(spotify_data$track_album_release_date), month = lubridate::month(spotify_data$track_album_release_date), day = lubridate::day(spotify_data$track_album_release_date))
We aim at analyzing the trends that the data follows according to the artist name and genre types over the years that it was released in. We thereby split the track_album_release_date into year, month and day
Step4: Selecting the required colums from the dataset
spotify_data <- select(spotify_data,track_artist,year,track_album_name,track_name,track_popularity,everything(),-c(track_id,track_album_id,playlist_id,track_album_release_date,month,day))
We selected all the variables except for track_id,track_album_id and playlist_id and also rearranged the selected columns in a more comprehensible manner.
The data preview below is the top 100 rows of the cleaned data.
output_data <- head(spotify_data, n = 100)
datatable(output_data, filter = 'top', options = list(pageLength = 25))
Below is the table containing the variable name, its data type and descriptions for the spotify songs dataset.
variable_name <- colnames(spotify_data)
variable_descp <- c("Song Artist", "Year when album released",
"Song album name",
"Song name",
"Song Popularity (0-100) where higher is better",
"Name of playlist",
"Playlist genre",
"Playlist subgenre",
"Danceability describes how suitable a track is for dancing based on a combination of musical elements. A value of 0.0 is least danceable and 1.0 is most danceable","Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.",
"The estimated overall key of the track",
"The overall loudness of a track in decibels (dB)",
"Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived",
"Speechiness detects the presence of spoken words in a track.",
"A confidence measure from 0.0 to 1.0 of whether the track is acoustic",
"Predicts whether a track contains no vocals. ",
"Detects the presence of an audience in the recording",
"A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track",
"The overall estimated tempo of a track in beats per minute (BPM)",
"Duration of song in milliseconds"
)
col_types <- lapply(spotify_data, class)
variable_summary <- as.data.frame(cbind(variable_name, col_types, variable_descp), row.names = F, )
colnames(variable_summary) <- c("Variable Name", "Data Type", "Description")
kable(variable_summary)
| Variable Name | Data Type | Description |
|---|---|---|
| track_artist | character | Song Artist |
| year | numeric | Year when album released |
| track_album_name | character | Song album name |
| track_name | character | Song name |
| track_popularity | integer | Song Popularity (0-100) where higher is better |
| playlist_name | character | Name of playlist |
| playlist_genre | character | Playlist genre |
| playlist_subgenre | character | Playlist subgenre |
| danceability | numeric | Danceability describes how suitable a track is for dancing based on a combination of musical elements. A value of 0.0 is least danceable and 1.0 is most danceable |
| energy | numeric | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. |
| key | integer | The estimated overall key of the track |
| loudness | numeric | The overall loudness of a track in decibels (dB) |
| mode | integer | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived |
| speechiness | numeric | Speechiness detects the presence of spoken words in a track. |
| acousticness | numeric | A confidence measure from 0.0 to 1.0 of whether the track is acoustic |
| instrumentalness | numeric | Predicts whether a track contains no vocals. |
| liveness | numeric | Detects the presence of an audience in the recording |
| valence | numeric | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track |
| tempo | numeric | The overall estimated tempo of a track in beats per minute (BPM) |
| duration_ms | integer | Duration of song in milliseconds |
We aim to study the different graphs of the audio features and the correlation it has with track popularity and genre classification. The 12 audio features are used to classify the songs into genres, and in order to do we would be using boxplots and histograms to study the distribution of the variables. We also would be using density plots to observe and compare the 12 audio features for different genre.
Data visualization would utilize the ggplot2 package. We would also be building a model to predict the popularity of the track given the audio features that seem to be significant through our exploratory data analysis.