1. Introduction
Spotify is a digital podcast and music streaming service with over 248 million monthly active users across the globe. The paid service of the company known as ‘Spotify Premium’, currently has its user base growing at a staggering rate of +31% (year on year growth). While its keen features make an average user spend 25 hours per month on the service, the data behind the scenes is equally interesting to dive in and learn from. This ‘king of music streaming’ is widely recognized for its personalized music recommendations for its users, and the following analyses look into the key determinants that influence track popularity on Spotify. The analyses is primarily designed to aid firms and music distributors that operate within the digital streaming services domain.
1.1 Problem Statement
There are three main objectives for this analysis.
Identify the general trends in music affinity of Spotify users
Determine key influencers of track popularity on Spotify (by album release era and genre)
Intelligently group music tracks based on its characteristics
1.2 Implementation and Techniques
The dataset contains information on the artist, genre, characteristics, and popularity of various tracks on Spotify. The cleaned data would be analyzed using EDA techniques and data mining techniques (such as regression, word tokenization and clustering) in order to implement the objectives(mentioned in 1.1).
A regression approach such as Random Forest / Multiple Linear Regression would help identify the variable importance in determining track popularity. Tokenization and NLP techniques would help to identify if there is a set of words (in the titles) that come together for more popular tracks. Furthermore, clustering techniques can be employed to better group tracks of the same characteristics.
1.3 Key Consumers of the Analysis
The analyses are primarily designed to aid firms and music distributors that operate within the digital streaming services domain. Identifying user trends would help digital music distributors to better streamline their music offerings. The analysis would also help artists to understand their target consumers better (as the analysis is split by music genre).
2. Packages Required
Following are the packages required with their uses:
tidytext = To convert text to and from tidy formats
DT = HTML display of the data
tidyverse = Allows data manipulation
stringr = Allows string operations
magrittr = Pipe operator in r programming
ggplot2 = For graphical representation in r
dplyr = For data manipulation in r
gridExtra = Allows grod formating of ggplot figures
pracma = Allows advanced numerical analyses
treemap = Allows treemap visualizations
tm = For text mining
GGally = Allows in-depth EDA , works in synchrony with ggplot
randomForest = Creates random forest models
wordcloud = For word cloud generator
plotly = For creating interactive web-based graphs
###########################################
# Installing / Loading Necessary Packages #
###########################################
list_packages <- c("tidytext", "DT", "tidyverse", "stringr", "magrittr", "gridExtra",
"pracma", "treemap", "tm", "GGally", "randomForest", "wordcloud",
"plotly", "ggplot2", "dplyr", "data.table", "rmarkdown", "tinytex",
"knitr")
new_packages <- list_packages[!(list_packages %in% installed.packages()[,"Package"])]
if( length(new_packages) ) install.packages(new_packages)
lapply(list_packages, require, character.only = TRUE)3. Data Preparation
3.1 Original Data Source
Original Data Source can be found here.
3.2 Explanation of Source Data
The data comes from Spotify and is sourced via the spotifyr package. The package was authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. The main purpose of the package was to obtain general metadata for songs (from Spotify’s API) in an easier fashion. The updated data can be collected via the package, as it runs based on the API. The source data has 32833 observations and 23 variables. The dataset contains 15 missing values and these values have not been imputed in the original dataset. The variable ‘track_id’ is a unique song id, though we see 4477 duplicate values of this column, this is because one song can be associated with multiple genres on Spotify dataset.
3.3 Data Importing and Cleaning
Step 1 : Converting blanks cells to NA while importing the dataset
#################
# Download Data #
#################
spotify <- read.csv(paste(getwd(),'/Master.csv',sep='') , na.strings = c(""," ",NA))Step 2 : Identifying Missing Values in the dataset
Missing values were identified for each variable. 15 missing values were identified in the dataset. These observations were removed as it formed a very small proportion (0.0005%) of the dataset.
###################
# Preprocess Data #
###################
# Identifying columns with missing values
list <- colnames(spotify)
for (i in 1:length(list)){
dat <- spotify[list[i]]
print(paste("Number of missing values in column",list[i]," is : " ,sum(is.na(dat))))
}## [1] "Number of missing values in column track_id is : 0"
## [1] "Number of missing values in column track_name is : 5"
## [1] "Number of missing values in column track_artist is : 5"
## [1] "Number of missing values in column track_popularity is : 0"
## [1] "Number of missing values in column track_album_id is : 0"
## [1] "Number of missing values in column track_album_name is : 5"
## [1] "Number of missing values in column track_album_release_date is : 0"
## [1] "Number of missing values in column playlist_name is : 0"
## [1] "Number of missing values in column playlist_id is : 0"
## [1] "Number of missing values in column playlist_genre is : 0"
## [1] "Number of missing values in column playlist_subgenre is : 0"
## [1] "Number of missing values in column danceability is : 0"
## [1] "Number of missing values in column energy is : 0"
## [1] "Number of missing values in column key is : 0"
## [1] "Number of missing values in column loudness is : 0"
## [1] "Number of missing values in column mode is : 0"
## [1] "Number of missing values in column speechiness is : 0"
## [1] "Number of missing values in column acousticness is : 0"
## [1] "Number of missing values in column instrumentalness is : 0"
## [1] "Number of missing values in column liveness is : 0"
## [1] "Number of missing values in column valence is : 0"
## [1] "Number of missing values in column tempo is : 0"
## [1] "Number of missing values in column duration_ms is : 0"
Step 3 : Understanding preliminary data structure and performing necessary type conversions of variables
Variables that could help aid the impact of text mining, were converted into character variables. The trailing spaces of these variables were removed and the characters were converted to lower case.
# identifying structure of the dataset
str(spotify)
# necessary type conversions
spotify$track_id <- tolower(trimws(as.character(spotify$track_id)))
spotify$track_name <- tolower(trimws(as.character(spotify$track_name)))
spotify$track_artist <- tolower(trimws(as.character(spotify$track_artist)))
spotify$track_album_id <- tolower(trimws(as.character(spotify$track_album_id)))
spotify$track_album_name <- tolower(trimws(as.character(spotify$track_album_name)))
spotify$playlist_name <- tolower(trimws(as.character(spotify$playlist_name)))
spotify$playlist_id <- tolower(trimws(as.character(spotify$playlist_id)))
spotify$playlist_genre <- tolower(trimws(as.character(spotify$playlist_genre)))
spotify$playlist_subgenre <- tolower(trimws(as.character(spotify$playlist_subgenre)))Step 4 : Identifying duplicate observations if any
A user defined function is created to remove duplicate observations in the dataset. No duplicate observations were found in the dataset. Though the variable ‘track_id’ (while being a unique song identifier) has 4477 duplicate values because each song can be associated with multiple genres on spotify. Hence no manipulation was done on this variable so as to retain the association between tracks and genres.
# identifying duplicate observations in the dataset
# user defined function to look for duplicated in the data
func_duplicate <- function(x){if(sum(duplicated(x))>0)
{x<- x %>% distinct()}else{print("No duplicate observations found in the dataset")}
}
func_duplicate(spotify)## [1] "No duplicate observations found in the dataset"
3.4 Cleaned Dataset
Please find below a sample from the cleaned dataset.
# Outputting Cleaned Data
knitr::kable( head(spotify,3), align = "lccrr", caption = "A sample data") | track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6f807x0ima9a1j3vpbc7vn | i don’t care (with justin bieber) - loud luxury remix | ed sheeran | 66 | 2ocs0dgtsro98gh5zsl2cx | i don’t care (with justin bieber) [loud luxury remix] | 6/14/2019 | pop remix | 37i9dqzf1dxczdd7cfekhw | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
| 0r7cvbztwzgbtcydfa2p31 | memories - dillon francis remix | maroon 5 | 67 | 63rpso264urjw1x5e6cwv6 | memories (dillon francis remix) | 12/13/2019 | pop remix | 37i9dqzf1dxczdd7cfekhw | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
| 1z1hg7vb0ahhdiemnde79l | all the time - don diablo remix | zara larsson | 70 | 1hosmj2elcsrr0ve9gthr4 | all the time (don diablo remix) | 7/5/2019 | pop remix | 37i9dqzf1dxczdd7cfekhw | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
3.5 Summary of the Variables
Below is the summary of concerned variables for the analysis. The primary source of the data description can also be found here . “track_popularity” is one of the main variables of interest and the summary statistics of the variable has been provided. Variables are not removed in the EDA process, hence a description of all these variables have been provided.
# reading data dictionary file
a <- read.csv(paste(getwd(),'/dat_dict.csv',sep=''))
knitr::kable(a, align = "lccrr")| variable | class | description |
|---|---|---|
| track_id | character | Song Unique ID |
| track_name | character | Song Name |
| track_artist | character | Song Artist |
| track_popularity | double | Song Popularity (0-100) where higher is better |
| track_album_id | character | Album unique ID |
| track_album_name | character | Song album name |
| track_album_release_date | Factor | Date when album released |
| playlist_name | character | Name of playlist |
| playlist_genre | character | Playlist genre |
| playlist_subgenre | character | Playlist subgenre |
| danceability | num | Danceability describes how suitable a track is for dancing. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | num | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. |
| key | int | The estimated overall key of the track. |
| loudness | num | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. |
| mode | int | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | num | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | num | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | num | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | num | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | num | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | num | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | num | Duration of song in milliseconds |
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 24.00 45.00 42.48 62.00 100.00
4. Proposed Exploratory Data Analysis
This section contains an outline for the EDA plan (for the final project).
4.1 EDA plan summary
Exploratory Data Analysis would consist of the following steps:
Understanding trends such as count of artists, tracks by genre - This to understand if songs available on
Spotify is skewed in any mannerFeature engineering - Creating new variables to classify the era of album release . This would help identify trends by the album era. An analysis would also be done to understand if factors that affect the popularity of tracks varies by album release era
Understanding the distribution of track_popularity across various genres using kernel density plots:
- Is the popularity distribution of a genre closer to its mean (vs. others)? This would suggest that tracks of this genre has a higher probability (for the same confidence interval) of being popular amongst Spotify
users
- Is the popularity distribution of a genre closer to its mean (vs. others)? This would suggest that tracks of this genre has a higher probability (for the same confidence interval) of being popular amongst Spotify
Understanding Spotify user base through correlation factor analysis :
- Are danceable tracks more popular on Spotify ?
- Are tracks conveying higher positivity (high valence) more popular?
- Do users rate audiobooks/ poetry/ podcasts higher than songs on Spotify (analysis using speechiness variable)?
Do we have any common trends in highly popular/ less popular artists on Spotify?
- The common music keys used in highly popular album artists (analyzed using the key variable)
- Bag of words analysis of music titles / podcast titles within highly popular tracks ( Is this different from the bag of words obtained in less popular tracks)
Have the characteristics of a music genre changed over time?
- Understanding changes in key variables( such as danceability, key, loudness) of a genre over different eras of album release
Analysing the presence of ‘Network Effect’ on the popularity of artists:
- Do highly popular artists have a significantly higher number of tracks vs.(others). If the difference is significant we can further investigate if the higher number of tracks causes an increase in the visibility of an artist and hence improves their overall popularity
Chi Square test to understand the significance of various variables in determining the track popularity
4.2 Types of Plots and Tables
For the EDA steps in section 4.1 the following plot types and tables would be used:
Kernel Density Plots ( to visualize distribution of a numeric variable grouped by a categorical variable)
Lollipop Charts ( to visualize category wise relative ranking in an analysis)
Bar Charts ( to visualize category wise relative ranking in an analysis)
Connected scatter plots ( to visualize change in popularity of a genre by release date)
Tree map ( to visualize the major categories based on a numeric variable)
Data Tables to represnt bag of words analyses
Correlation matrix tables ( to visualize correlation factors in an n*n matrix format)
Boxplots ( Visualize distribution of a numeric variable)
4.3 Techniques to be Learnt
I would need to improve my skills in the Natural Language Processing arena of analytics, to deliver more impactful and insightful analyses from the dataset.
4.4 Data Mining Techniques to be Incorporated
Regression techniques to identify key factors that determine track popularity: The analysis would include regression techniques such as Random Forests, Multiple Linear Regression, Boosting( selected based on best prediction performance)
Text mining techniques: Additionally, Natural Language Processing techniques such as word tokenization, word clouds, TF-IDF would be
employed to understand the common song title words amongst highly popular tracksClustering techniques: Furthermore, unsupervised learning techniques such as clustering would be employed to intelligently group tracks having the same characteristics.