The world of media arts consumption has gone through dramatic changes over these past 30 years. In this new era of instant & boundless content access, everything from the projects that get produced, the way they are produced, and the user experience have gone through astounding metamorphosis. Media companies that have risen to the top and managed to stay are those who understood early on that user attraction & retention is the single most important pillar to success.
Netflix’s provides a perfect illustration of this. They have consistently prioritized refining their original content creation process and sophisticated suggestion algorithms in spite of the naysayers resoundingly ensuring the company’s demise. After 23 years of consistent quarterly debt raising, negative net income & terrible financial health ratios, it did not take long for the great majority of analysts to think Netflix was in severe crash-and-burn trouble. This year, Netflix made an announcement of major implications to the media industry: they no longer require any more rounds of borrowed capital to keep operations going. This means, should everything continue as it has, they are in an excellent position to service their debt, have budget for their extravagant productions AND start building tangible profit. The “risk” they took for putting the user experience first after so many years has literally paid out. They have taken their rightful position as the streaming audiovisual content powerhouse because of it.
Inspired by this, we have chosen to examine Spotify’s data repository to explore the song attributes they choose to capture and construct a scoring measure by which to group songs together and use it as a base for song recommendations.
The relevance of this exploration is two-fold. Spotify’s user experience offering is largely dependent on song curation & opening multiple avenues by which the user can explore new music they will actually like. In that sense, deep-diving into the trends & characteristics that associate certain songs to others will propulse Spotify’s ability to accurately churn out music recommendations and streamline their vetting process of where to post new songs as these get added to the platform.
On the other side, this study means precious insight to those in the music creation side of the picture. The plethora of avenues available to post & to find new music leaves up-and-coming artists with no choice but to pay others to figure out a marketing strategy, or otherwise go years without finding a solid way to gain listeners. Our study provides a simplified approach to remedy that: a scoring system that allows anyone to understand the musical characteristics that a given song boils down to & the type of music associated to these base attributes. This information can be capitalized by the creator to find the venues and platforms were these associated songs have presence and take a share in those markets. A creator can also gain understanding of how to manipulate their content to achieve a desired musical association. In general, these types of multifaceted insights are usually guarded by label executives. This study enables the artist to take back some of that control.
The Spotify Web API provides artist, album, and track data, as well as audio features and analysis, all easily accessible via the R package spotifyr.
It’s likely that Spotify uses these features to power products like Spotify Radio and custom playlists like Discover Weekly and Daily Mixes. Spotify has the benefit of letting humans create relationships between songs and weigh in on genre via listening and creating playlists.
library(knitr) # Used to create a document that is a mixture of text and some chunks of code
library(kableExtra) #Used for enhancing the aesthetics of the table outputs in the document
library(tidyverse) #Used for faster and easier Data manipulation and Wrangling tasks
library(dplyr) #Used for faster and easier Data manipulation and Wrangling tasks
There are 23 audio features for each track, including confidence measures like acousticness
, liveness
, speechiness and instrumentalness
, perceptual measures like energy
, loudness
, danceability
and valence
(positiveness), and descriptors like duration
, tempo
, key
, and mode
.
A brief description of the variables is as mentioned below:
<- 'https://raw.githubusercontent.com/vpcincin/DataWrangling/main/Data_Dictionary.csv'
url <- readr::read_csv(url)
spotify_dictionary kable(spotify_dictionary[, ], format = "simple")
variable | class | description |
---|---|---|
track_id | character | Song unique ID |
track_name | character | Song Name |
track_artist | character | Song Artist |
track_popularity | double | Song Popularity (0-100) where higher is better |
track_album_id | character | Album unique ID |
track_album_name | character | Song album name |
track_album_release_date | character | Date when album released |
playlist_name | character | Name of playlist |
playlist_id | character | Playlist ID |
playlist_genre | character | Playlist genre |
playlist_subgenre | character | Playlist subgenre |
danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C?/D?, 2 = D, and so on. If no key was detected, the value is -1. |
loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
instrumentalness | double | Predicts whether a track contains no vocals. <U+0093>Ooh<U+0094> and <U+0093>aah<U+0094> sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly <U+0093>vocal<U+0094>. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
duration_ms | double | Duration of song in milliseconds |
Read the data into dataframe
#Ceating the "spotify_songs" dataframe below by reading the data from github
<- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv') spotify_songs
The Data is scraped for 2021-01-21
i.e week 4 of 2020.
The dataset has 32833 records at a Track-Genre-Artist level and 23 variables.
dim(spotify_songs)
## [1] 32833 23
Surprisingly, there are only a total of 15 Null values in the 32833 X 23 dataframe, which is amazing considering it is a real dataset. When we deep dived into the null values across columns we see that - trac_artist
, track_name
and track_album_name
have 5 Null values each.
#getting the count of total null values in data
sum(is.na(spotify_songs))
## [1] 15
#getting null values by columns
colSums(is.na(spotify_songs))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
We looked at the names of all columns to check if the names are intuitive and if they follow a uniform naming convention , one commonly used in R. We decided to use snake case throughout the project.
#printing variable names
names(spotify_songs)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
The Variable names look consistent and easy to interpret, hence there is no need for any variable name change.
It is vital to have the correct data type for each column prior to any analysis. Hence, we used str() to observe the data types of each column and changed the data type wherever necessary. Below are the observations:
# checking variable types for consistencies
str(spotify_songs[])
## tibble [32,833 x 23] (S3: tbl_df/tbl/data.frame)
## $ track_id : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr [1:32833] "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num [1:32833] 122 100 124 122 124 ...
## $ duration_ms : num [1:32833] 194754 162600 176616 169093 189052 ...
Observations:
mode
currently has a numeric field, however it is a factor/Boolean variable, as it has values{0,1}.track_album_release_date
is currently a character column but its actually a field with date values; Its vital to change this as we would need this column in date format for analysis- Example: Time series plots, Y-o-Y growth analysis,etc. #Modyfying Data types
$mode <- as.factor(spotify_songs$mode)
spotify_songs$track_album_release_date <- as.Date(spotify_songs$track_album_release_date) spotify_songs
We can confirm that the conversion of data type has reflected in the data.
#QC step
class(spotify_songs$mode)
## [1] "factor"
class(spotify_songs$track_album_release_date)
## [1] "Date"
We had earlier seen that we had 5 missing values each in track_artist, track_album_name and track_name. We impute these missing values with a character constant ‘unknown’. Further, these missing values do not pose a serious threat to any of the analysis that we expect to oerform in the future. This is because of two primary reasons - a) It is a small fraction in our dataset b) We still have a lot of information for these records that we can use for our EDA
#Missing Value Treatment
$track_artist[is.na(spotify_songs$track_artist)] <- 'unknown'
spotify_songs$track_album_name[is.na(spotify_songs$track_album_name)] <- 'unknown'
spotify_songs$track_name[is.na(spotify_songs$track_name)] <- 'unknown' spotify_songs
We see now that we do not have any missing values in our data.
#QC step
sum(is.na(spotify_songs))
## [1] 1886
Looking at the summary of the numeric data provides us a high-level understanding of data distribution and centrality. While glancing through the summary, we can also quickly get an idea about which columns to thoroughly inspect for outliers.
#Generating summary
summary(select_if(spotify_songs,is.numeric))
## track_popularity danceability energy key
## Min. : 0.00 Min. :0.0000 Min. :0.000175 Min. : 0.000
## 1st Qu.: 24.00 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000
## Median : 45.00 Median :0.6720 Median :0.721000 Median : 6.000
## Mean : 42.48 Mean :0.6548 Mean :0.698619 Mean : 5.374
## 3rd Qu.: 62.00 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000
## Max. :100.00 Max. :0.9830 Max. :1.000000 Max. :11.000
## loudness speechiness acousticness instrumentalness
## Min. :-46.448 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.: -8.171 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median : -6.166 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean : -6.720 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.: -4.645 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. : 1.275 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
From the descriptive statistics of the numeric variables that we obtained above, we see that for some variables the mean is not very close to the median, which indicates the skewness in the data and further hints towards the possibility of potential outliers. We also can get an idea of whether the outlier is towards the lower bound or the upper bound in the data, i.e Right skewness (Mean > Median) suggests outliers towards the upper bound and Left skewness (Mean < Median) suggests that the outliers are towards the lower bound.
To further check if the variables have outliers in the data we plot the distribution of these variables using boxplots (In the visual summary secion)
For the character variables, we explore the number of levels/distinct value in each variable.
#checking levels for character variables
<- lapply(select_if(spotify_songs,is.character),unique)
ulst <- lengths(ulst)
k k
## track_id track_name track_artist track_album_id
## 28356 23450 10693 22545
## track_album_name playlist_name playlist_id playlist_genre
## 19744 449 471 6
## playlist_subgenre
## 24
Concerns from observation above:
Visual Summary
The box plots of variables will be helpful in outlier detection. In the analysis above, we observe that: few columns have the mean pulled towards on side due to outliers or skewness. Here we will be checking the boxplots of these variables to identify outliers and subsequently device a strategy to treat them.
#Generating boxplots
boxplot(spotify_songs$danceability, main = 'Boxplot distribution of Danceability')
boxplot(spotify_songs$loudness, main = 'Boxplot distribution of loudness')
boxplot(spotify_songs$tempo , main = 'Boxplot distribution of tempo')
From the boxplot distributions we see that the variable danceability has one value at 0, which stands out from the remaining of the variable. Similarly in loudness there is one value that is very low ‘-46’ and in tempo there is one value that is too high and one value that is too low than the majority of data points.
We can remove these records, to trim tails of these variables. As these are just a couple of records, it would not be harmful in terms of data loss and it is safe to remove these records from the dataset and visualize the dataset again to see the change in distribution.
#Trimming Outliers
<- subset(spotify_songs, danceability > min(danceability) & loudness > min(loudness) & tempo > min(tempo) & tempo < max(tempo)) new_df
#Visualizing the distributions again
boxplot(new_df$danceability, main = 'Distribution of Danceability')
boxplot(new_df$loudness, main = 'Distribution of loudness')
boxplot(new_df$tempo , main = 'Distribution of tempo')
Thus, in Data Cleaning, we have checked the variable types, imputed the missing values, we checked the numerical summaries and detected and treated the outliers.
The below table shows a glimpse of the final cleaned dataset.
#printing head
::kable(head(new_df,5), "simple") knitr
track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
Relevant Variables:
Since we are trying to build a recommendation engine to suggest similar songs to the Spotify user, we would need track attributes to characterize each track. Below are few potential variables that could be important to generate a similarity score for the tracks.
track_artist
, track_popularity
, track_album_name
, playlist_genre
, playlist_subgenre
, danceability
,loudness
,acousticness
, tempo
, liveness
, duration_ms
We plot the distributions of the variables of interest for our analysis.
#histogram for variables of interest
hist(new_df$track_popularity, main = 'Distribution of Track Popularity', xlab = 'Track popularity', col = 'light blue')
hist(new_df$danceability, main = 'Distribution of Dancebility', xlab = 'Danceability', col = 'light blue')
hist(new_df$loudness, main = 'Distribution of loudness', xlab = 'loudness', col = 'light blue')
The data has track duration in milliseconds which is too granular for us. Hence, we convert it to minutes and then plot the distribution. As expected, we see most tracks are between 3-5 minutes.
#converting the song duration in minutes
$duration_min <- new_df$duration_ms/60000 new_df
hist(new_df$duration_min, main = 'Distribution of duration in mins', xlab = 'loudness', col = 'light blue')
However, We still have a few concerns with the data that we need to investigate prior to analysis:
Concerns {Mentioned above}
The Spotify data is a rich dataset with playlist information from tracks ranging from 1905 to 2020. To deep dive into the data and look for interesting trends, we can look at the dataset by aggregating it into different levels- this would give us a feel of the data distribution at different levels. For our EDA, we will be considering the following cuts/levels that may help us to answer all/some of the following brainstormed business questions -
To answer all/some of the questions above, we must slice and dice the data and look at them from multiple dimensions. The dataset seems to be at a Track-Artist-Genre level and has various metrics for each track. However, to be able to track how well a certain song did for an artist, we need to look at performance in comparison to other tracks the same artist/genre – ‘Relative Metric’. For this, we plan to create aggregated views of the data at Artist and Genre level with fields like -:
The following shows a glimpse of the types of views we are considering to asnswer the key business questions -
We further intend to engineer features for relative track performance by joining the aggregated views with the data set:
Artist_Relative_[Metric] = Metric / Artist_Average_Metric
Genre_Relative_[Metric] = Metric / Genre_Average_Metric
We will add these metrics to the EDA mentioned above to get a better relative understanding of the track and its performance.
We plan to create a recommendation engine using the data set provided. The engine would take as input a list of tracks a user listens to, and leverage that to predict ‘Similar’ songs that the user ‘might’ like.