1. Introduction

Spotify is a digital podcast and music streaming service with over 248 million monthly active users across the globe. The paid service of the company known as ‘Spotify Premium’, currently has its user base growing at a staggering rate of +31% (year on year growth). While its keen features make an average user spend 25 hours per month on the service, the data behind the scenes is equally interesting to dive in and learn from. This ‘king of music streaming’ is widely recognized for its personalized music recommendations for its users, and the following analyses look into the key determinants that influence track popularity on Spotify. The analyses is primarily designed to aid firms and music distributors that operate within the digital streaming services domain.

1.1 Problem Statement

There are three main objectives for this analysis.

  1. Identify the general trends in music affinity of Spotify users

  2. Determine key influencers of track popularity on Spotify (by album release era and genre)

  3. Intelligently group music tracks based on its characteristics

1.2 Implementation and Techniques

The dataset contains information on the artist, genre, characteristics, and popularity of various tracks on Spotify. The cleaned data would be analyzed using EDA techniques and data mining techniques (such as regression, word tokenization and clustering) in order to implement the objectives(mentioned in 1.1).

A regression approach such as Random Forest / Multiple Linear Regression would help identify the variable importance in determining track popularity. Tokenization and NLP techniques would help to identify if there is a set of words (in the titles) that come together for more popular tracks. Furthermore, clustering techniques can be employed to better group tracks of the same characteristics.

1.3 Key Consumers of the Analysis

The analyses are primarily designed to aid firms and music distributors that operate within the digital streaming services domain. Identifying user trends would help digital music distributors to better streamline their music offerings. The analysis would also help artists to understand their target consumers better (as the analysis is split by music genre).


2. Packages Required

Following are the packages required with their uses:

tidytext = To convert text to and from tidy formats

DT = HTML display of the data

tidyverse = Allows data manipulation

stringr = Allows string operations

magrittr = Pipe operator in r programming

ggplot2 = For graphical representation in r

dplyr = For data manipulation in r

gridExtra = Allows grod formating of ggplot figures

pracma = Allows advanced numerical analyses

treemap = Allows treemap visualizations

tm = For text mining

GGally = Allows in-depth EDA , works in synchrony with ggplot

randomForest = Creates random forest models

wordcloud = For word cloud generator

plotly = For creating interactive web-based graphs

###########################################
# Installing / Loading Necessary Packages #
###########################################

list_packages <- c("tidytext", "DT", "tidyverse", "stringr", "magrittr", "gridExtra",
                   "pracma", "treemap", "tm", "GGally", "randomForest", "wordcloud",
                   "plotly", "ggplot2", "dplyr", "data.table", "rmarkdown", "tinytex",
                   "knitr")

new_packages <- list_packages[!(list_packages %in% installed.packages()[,"Package"])]

if( length(new_packages) ) install.packages(new_packages)

lapply(list_packages, require, character.only = TRUE)

3. Data Preparation

3.1 Original Data Source

Original Data Source can be found here.

3.2 Explanation of Source Data

The data comes from Spotify and is sourced via the spotifyr package. The package was authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. The main purpose of the package was to obtain general metadata for songs (from Spotify’s API) in an easier fashion. The updated data can be collected via the package, as it runs based on the API. The source data has 32833 observations and 23 variables. The dataset contains 15 missing values and these values have not been imputed in the original dataset. The variable ‘track_id’ is a unique song id, though we see 4477 duplicate values of this column, this is because one song can be associated with multiple genres on Spotify dataset.

3.3 Data Importing and Cleaning

Step 1 : Converting blanks cells to NA while importing the dataset

#################
# Download Data #
#################

spotify <-  read.csv(paste(getwd(),'/Master.csv',sep='') , na.strings = c(""," ",NA))

Step 2 : Identifying Missing Values in the dataset

Missing values were identified for each variable. 15 missing values were identified in the dataset. These observations were removed as it formed a very small proportion (0.0005%) of the dataset.

###################
# Preprocess Data #
###################

# Identifying columns with missing values 

list <-  colnames(spotify)
for (i in 1:length(list)){
                          dat <-  spotify[list[i]]
                          print(paste("Number of missing values in column",list[i]," is : " ,sum(is.na(dat))))
                         }
## [1] "Number of missing values in column track_id  is :  0"
## [1] "Number of missing values in column track_name  is :  5"
## [1] "Number of missing values in column track_artist  is :  5"
## [1] "Number of missing values in column track_popularity  is :  0"
## [1] "Number of missing values in column track_album_id  is :  0"
## [1] "Number of missing values in column track_album_name  is :  5"
## [1] "Number of missing values in column track_album_release_date  is :  0"
## [1] "Number of missing values in column playlist_name  is :  0"
## [1] "Number of missing values in column playlist_id  is :  0"
## [1] "Number of missing values in column playlist_genre  is :  0"
## [1] "Number of missing values in column playlist_subgenre  is :  0"
## [1] "Number of missing values in column danceability  is :  0"
## [1] "Number of missing values in column energy  is :  0"
## [1] "Number of missing values in column key  is :  0"
## [1] "Number of missing values in column loudness  is :  0"
## [1] "Number of missing values in column mode  is :  0"
## [1] "Number of missing values in column speechiness  is :  0"
## [1] "Number of missing values in column acousticness  is :  0"
## [1] "Number of missing values in column instrumentalness  is :  0"
## [1] "Number of missing values in column liveness  is :  0"
## [1] "Number of missing values in column valence  is :  0"
## [1] "Number of missing values in column tempo  is :  0"
## [1] "Number of missing values in column duration_ms  is :  0"
spotify <-  na.omit(spotify)

Step 3 : Understanding preliminary data structure and performing necessary type conversions of variables

Variables that could help aid the impact of text mining, were converted into character variables. The trailing spaces of these variables were removed and the characters were converted to lower case.

#  identifying structure of the dataset

str(spotify)

#  necessary type conversions

spotify$track_id <- tolower(trimws(as.character(spotify$track_id)))

spotify$track_name <- tolower(trimws(as.character(spotify$track_name)))

spotify$track_artist <- tolower(trimws(as.character(spotify$track_artist)))

spotify$track_album_id <- tolower(trimws(as.character(spotify$track_album_id)))

spotify$track_album_name <- tolower(trimws(as.character(spotify$track_album_name)))

spotify$playlist_name <- tolower(trimws(as.character(spotify$playlist_name)))

spotify$playlist_id <- tolower(trimws(as.character(spotify$playlist_id)))

spotify$playlist_genre <- tolower(trimws(as.character(spotify$playlist_genre)))

spotify$playlist_subgenre <- tolower(trimws(as.character(spotify$playlist_subgenre)))

Step 4 : Identifying duplicate observations if any

A user defined function is created to remove duplicate observations in the dataset. No duplicate observations were found in the dataset. Though the variable ‘track_id’ (while being a unique song identifier) has 4477 duplicate values because each song can be associated with multiple genres on spotify. Hence no manipulation was done on this variable so as to retain the association between tracks and genres.

#  identifying duplicate observations in the dataset

# user defined function to look for duplicated in the data
func_duplicate <-  function(x){if(sum(duplicated(x))>0)
                                 {x<- x %>% distinct()}else{print("No duplicate observations found in the dataset")}
                              }

func_duplicate(spotify)
## [1] "No duplicate observations found in the dataset"

3.4 Cleaned Dataset

Please find below a sample from the cleaned dataset.

#  Outputting Cleaned Data
knitr::kable( head(spotify,3), align = "lccrr", caption = "A sample data") 
A sample data
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre playlist_subgenre danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
6f807x0ima9a1j3vpbc7vn i don’t care (with justin bieber) - loud luxury remix ed sheeran 66 2ocs0dgtsro98gh5zsl2cx i don’t care (with justin bieber) [loud luxury remix] 6/14/2019 pop remix 37i9dqzf1dxczdd7cfekhw pop dance pop 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036 194754
0r7cvbztwzgbtcydfa2p31 memories - dillon francis remix maroon 5 67 63rpso264urjw1x5e6cwv6 memories (dillon francis remix) 12/13/2019 pop remix 37i9dqzf1dxczdd7cfekhw pop dance pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972 162600
1z1hg7vb0ahhdiemnde79l all the time - don diablo remix zara larsson 70 1hosmj2elcsrr0ve9gthr4 all the time (don diablo remix) 7/5/2019 pop remix 37i9dqzf1dxczdd7cfekhw pop dance pop 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008 176616

3.5 Summary of the Variables

Below is the summary of concerned variables for the analysis. The primary source of the data description can also be found here . “track_popularity” is one of the main variables of interest and the summary statistics of the variable has been provided. Variables are not removed in the EDA process, hence a description of all these variables have been provided.

# reading data dictionary file
a <- read.csv(paste(getwd(),'/dat_dict.csv',sep=''))

knitr::kable(a, align = "lccrr")
variable class description
track_id character Song Unique ID
track_name character Song Name
track_artist character Song Artist
track_popularity double Song Popularity (0-100) where higher is better
track_album_id character Album unique ID
track_album_name character Song album name
track_album_release_date Factor Date when album released
playlist_name character Name of playlist
playlist_genre character Playlist genre
playlist_subgenre character Playlist subgenre
danceability num Danceability describes how suitable a track is for dancing. A value of 0.0 is least danceable and 1.0 is most danceable.
energy num Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
key int The estimated overall key of the track.
loudness num The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.
mode int Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness num Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness num A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness num Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness num Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence num A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo num The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms num Duration of song in milliseconds
# summary statistics of the variable 'track_popularity'
summary(spotify$track_popularity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.00   45.00   42.48   62.00  100.00

4. Proposed Exploratory Data Analysis

This section contains an outline for the EDA plan (for the final project).

4.1 EDA plan summary

Exploratory Data Analysis would consist of the following steps:

  1. Understanding trends such as count of artists, tracks by genre - This to understand if songs available on
    Spotify is skewed in any manner

  2. Feature engineering - Creating new variables to classify the era of album release . This would help identify trends by the album era. An analysis would also be done to understand if factors that affect the popularity of tracks varies by album release era

  3. Understanding the distribution of track_popularity across various genres using kernel density plots:

    • Is the popularity distribution of a genre closer to its mean (vs. others)? This would suggest that tracks of this genre has a higher probability (for the same confidence interval) of being popular amongst Spotify
      users
  4. Understanding Spotify user base through correlation factor analysis :

    • Are danceable tracks more popular on Spotify ?
    • Are tracks conveying higher positivity (high valence) more popular?
    • Do users rate audiobooks/ poetry/ podcasts higher than songs on Spotify (analysis using speechiness variable)?
  5. Do we have any common trends in highly popular/ less popular artists on Spotify?

    • The common music keys used in highly popular album artists (analyzed using the key variable)
    • Bag of words analysis of music titles / podcast titles within highly popular tracks ( Is this different from the bag of words obtained in less popular tracks)
  6. Have the characteristics of a music genre changed over time?

    • Understanding changes in key variables( such as danceability, key, loudness) of a genre over different eras of album release
  7. Analysing the presence of ‘Network Effect’ on the popularity of artists:

    • Do highly popular artists have a significantly higher number of tracks vs.(others). If the difference is significant we can further investigate if the higher number of tracks causes an increase in the visibility of an artist and hence improves their overall popularity
  8. Chi Square test to understand the significance of various variables in determining the track popularity

4.2 Types of Plots and Tables

For the EDA steps in section 4.1 the following plot types and tables would be used:

  • Kernel Density Plots ( to visualize distribution of a numeric variable grouped by a categorical variable)

  • Lollipop Charts ( to visualize category wise relative ranking in an analysis)

  • Bar Charts ( to visualize category wise relative ranking in an analysis)

  • Connected scatter plots ( to visualize change in popularity of a genre by release date)

  • Tree map ( to visualize the major categories based on a numeric variable)

  • Data Tables to represnt bag of words analyses

  • Correlation matrix tables ( to visualize correlation factors in an n*n matrix format)

  • Boxplots ( Visualize distribution of a numeric variable)

4.3 Techniques to be Learnt

I would need to improve my skills in the Natural Language Processing arena of analytics, to deliver more impactful and insightful analyses from the dataset.

4.4 Data Mining Techniques to be Incorporated

  • Regression techniques to identify key factors that determine track popularity: The analysis would include regression techniques such as Random Forests, Multiple Linear Regression, Boosting( selected based on best prediction performance)

  • Text mining techniques: Additionally, Natural Language Processing techniques such as word tokenization, word clouds, TF-IDF would be
    employed to understand the common song title words amongst highly popular tracks

  • Clustering techniques: Furthermore, unsupervised learning techniques such as clustering would be employed to intelligently group tracks having the same characteristics.


References

https://www.statista.com/statistics/813876/spotify-monthly-active-users-time-spent-listening/

https://qz.com/1736762/spotify-grows-monthly-active-users-and-turns-profit-shares-jump-15-percent/

https://www.tunefab.com/spotify/spotify-bluetooth.html