1. Introduction

Spotify is a digital podcast and music streaming service with over 248 million monthly active users across the globe. The paid service of the company known as ‘Spotify Premium’, currently has its user base growing at a staggering rate of +31% (year on year growth). While its keen features make an average user spend 25 hours per month on the service, the data behind the scenes is equally interesting to dive in and learn from. This ‘king of music streaming’ is widely recognized for its personalized music recommendations for its users, and the following analyses look into the key determinants that influence track popularity on Spotify. The analyses is primarily designed to aid firms and music distributors that operate within the digital streaming services domain.

1.1 Problem Statement

There are three main objectives for this analysis.

  1. Identify the general trends in music affinity of Spotify users

  2. Determine the key influencers of track popularity on Spotify and predict popularity class ( ‘high’ , ‘medium’ , ‘low’ )

1.2 Implementation and Techniques

The dataset contains information on the artist, genre, characteristics, and popularity of various tracks on Spotify. The cleaned data would be analyzed using EDA techniques and data mining techniques (such as regression, word tokenization and clustering) in order to implement the objectives(mentioned in 1.1).

A regression approach such as Random Forest / Multiple Linear Regression would help identify the variable importance in determining track popularity. Tokenization and NLP techniques would help to identify if there is a set of words (in the titles) that come together for more popular tracks.

1.3 Key Consumers of the Analysis

The analyses are primarily designed to aid firms and music distributors that operate within the digital streaming services domain. Identifying user trends would help digital music distributors to better streamline their music offerings. The analysis would also help artists to understand their target consumers better (as the analysis is split by music genre).


2. Packages Required

Following are the packages required with their uses:

tidytext = To convert text to and from tidy formats

DT = HTML display of the data

tidyverse = Allows data manipulation

stringr = Allows string operations

magrittr = Pipe operator in r programming

ggplot2 = For graphical representation in r

dplyr = For data manipulation in r

gridExtra = Allows grod formating of ggplot figures

pracma = Allows advanced numerical analyses

treemap = Allows treemap visualizations

tm = For text mining

GGally = Allows in-depth EDA , works in synchrony with ggplot

randomForest = Creates random forest models

wordcloud = For word cloud generator

plotly = For creating interactive web-based graphs


3. Data Preparation

3.1 Original Data Source

Original Data Source can be found here.

3.2 Explanation of Source Data

The data comes from Spotify and is sourced via the spotifyr package. The package was authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. The main purpose of the package was to obtain general metadata for songs (from Spotify’s API) in an easier fashion. The updated data can be collected via the package, as it runs based on the API. The source data has 32833 observations and 23 variables. The dataset contains 15 missing values and these values have not been imputed in the original dataset. The variable ‘track_id’ is a unique song id, though we see 4477 duplicate values of this column, this is because one song can be associated with multiple genres on Spotify dataset.

3.3 Data Importing and Cleaning

Step 1 : Converting blanks cells to NA while importing the dataset

#################
# Download Data #
#################

spotify <-  read.csv('Master.csv', na.strings = c(""," ",NA))

Step 2 : Identifying Missing Values in the dataset

Missing values were identified for each variable. 15 missing values were identified in the dataset. These observations were removed as it formed a very small proportion (0.0005%) of the dataset.

###################
# Preprocess Data #
###################

# Identifying columns with missing values 

list <-  colnames(spotify)
for (i in 1:length(list)){
                          dat <-  spotify[list[i]]
                          print(paste("Number of missing values in column",list[i]," is : " ,sum(is.na(dat))))
                         }
## [1] "Number of missing values in column track_id  is :  0"
## [1] "Number of missing values in column track_name  is :  5"
## [1] "Number of missing values in column track_artist  is :  5"
## [1] "Number of missing values in column track_popularity  is :  0"
## [1] "Number of missing values in column track_album_id  is :  0"
## [1] "Number of missing values in column track_album_name  is :  5"
## [1] "Number of missing values in column track_album_release_date  is :  0"
## [1] "Number of missing values in column playlist_name  is :  0"
## [1] "Number of missing values in column playlist_id  is :  0"
## [1] "Number of missing values in column playlist_genre  is :  0"
## [1] "Number of missing values in column playlist_subgenre  is :  0"
## [1] "Number of missing values in column danceability  is :  0"
## [1] "Number of missing values in column energy  is :  0"
## [1] "Number of missing values in column key  is :  0"
## [1] "Number of missing values in column loudness  is :  0"
## [1] "Number of missing values in column mode  is :  0"
## [1] "Number of missing values in column speechiness  is :  0"
## [1] "Number of missing values in column acousticness  is :  0"
## [1] "Number of missing values in column instrumentalness  is :  0"
## [1] "Number of missing values in column liveness  is :  0"
## [1] "Number of missing values in column valence  is :  0"
## [1] "Number of missing values in column tempo  is :  0"
## [1] "Number of missing values in column duration_ms  is :  0"
spotify <-  na.omit(spotify)

Step 3 : Understanding preliminary data structure and performing necessary type conversions of variables

Variables that could help aid the impact of text mining, were converted into character variables. The trailing spaces of these variables were removed and the characters were converted to lower case.

#  identifying structure of the dataset

str(spotify)

#  necessary type conversions

spotify$track_id <- tolower(trimws(as.character(spotify$track_id)))
spotify$track_id <-  as.factor(spotify$track_id)

spotify$track_name <- tolower(trimws(as.character(spotify$track_name)))

spotify$track_artist <- tolower(trimws(as.character(spotify$track_artist)))

spotify$track_album_id <- tolower(trimws(as.character(spotify$track_album_id)))

spotify$track_album_name <- tolower(trimws(as.character(spotify$track_album_name)))

spotify$playlist_name <- tolower(trimws(as.character(spotify$playlist_name)))

spotify$playlist_id <- tolower(trimws(as.character(spotify$playlist_id)))

spotify$playlist_genre <- tolower(trimws(as.character(spotify$playlist_genre)))

spotify$playlist_subgenre <- tolower(trimws(as.character(spotify$playlist_subgenre)))

# Converting "Album Release Date" into year format

library(lubridate)
spotify$track_album_release_date_yr <- as.Date(spotify$track_album_release_date, format = "%m/%d/%Y")
spotify$track_album_release_date_yr <- format(as.Date(spotify$track_album_release_date_yr, "%y"),"%Y")
spotify$track_album_release_date_yr <- as.numeric(spotify$track_album_release_date_yr)

Step 4 : Identifying duplicate observations if any

A user defined function is created to remove duplicate observations in the dataset. No duplicate observations were found in the dataset. Though the variable ‘track_id’ (while being a unique song identifier) has 4477 duplicate values because each song can be associated with multiple genres on spotify. Hence no manipulation was done on this variable so as to retain the association between tracks and genres.

#  identifying duplicate observations in the dataset

# user defined function to look for duplicated in the data
func_duplicate <-  function(x){if(sum(duplicated(x))>0)
                                 {x<- x %>% distinct()}else{print("No duplicate observations found in the dataset")}
                              }

func_duplicate(spotify)
## [1] "No duplicate observations found in the dataset"

3.4 Feature Engineering

  • “Album Release Era” looks into the era of album release date

  • “Popularity Class” of a track groups tracks into high, medium , low popularity classes based on the quantile distribution of the popularity score. If a track has a popularity score > 3rd quartile , it is classified as ‘high’. If a track has a
    popularity score < 1st quartile it is classified as ‘low’ . Others are classified as medium

  • Based on the ‘speechiness variable’ an indicator is created to determine if a track is a podcast/ music

# Creating 'Album Realease Era'

spotify$release_era<-ifelse( spotify$track_album_release_date_yr < 1970 , "1960's",
                            ifelse( spotify$track_album_release_date_yr < 1980, "1970's",
                            ifelse( spotify$track_album_release_date_yr < 1990, "1980's",
                            ifelse( spotify$track_album_release_date_yr < 2000,"1990's",
                            ifelse( spotify$track_album_release_date_yr < 2010 , "2000's","2010's")))))

# Classifying tracks based on popularity score
# a. Track populatiry < 1st quartile flagged as low popularity
# b. Track populatiry >= 1st and < 3 rd quartile flagged as medium popularity
# c.Track popularity higher than or equal to 3rd quartile flagged as high popularity

spotify$popularity_class <- ifelse( spotify$track_popularity < 24 , "low",
                                 ifelse( spotify$track_popularity >= 24 &spotify$track_popularity< 62, "medium","high"))

spotify$popularity_class <-  as.factor(spotify$popularity_class)

# Podcast / Music classificatin based on 'speechiness' variable

spotify$podcast_music_cls <-  ifelse( spotify$speechiness >=0.66, "podcast", "music" )

3.5 Cleaned Dataset

Please find below a sample from the cleaned dataset.

#  Outputting Cleaned Data

knitr::kable( head(spotify,3), align = "lccrr", caption = "A sample data") 
A sample data
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre playlist_subgenre danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms track_album_release_date_yr release_era popularity_class podcast_music_cls
6f807x0ima9a1j3vpbc7vn i don’t care (with justin bieber) - loud luxury remix ed sheeran 66 2ocs0dgtsro98gh5zsl2cx i don’t care (with justin bieber) [loud luxury remix] 6/14/2019 pop remix 37i9dqzf1dxczdd7cfekhw pop dance pop 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036 194754 2019 2010’s high music
0r7cvbztwzgbtcydfa2p31 memories - dillon francis remix maroon 5 67 63rpso264urjw1x5e6cwv6 memories (dillon francis remix) 12/13/2019 pop remix 37i9dqzf1dxczdd7cfekhw pop dance pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972 162600 2019 2010’s high music
1z1hg7vb0ahhdiemnde79l all the time - don diablo remix zara larsson 70 1hosmj2elcsrr0ve9gthr4 all the time (don diablo remix) 7/5/2019 pop remix 37i9dqzf1dxczdd7cfekhw pop dance pop 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008 176616 2019 2010’s high music

3.6 Summary of the Variables

Below is the summary of concerned variables for the analysis. The primary source of the data description can also be found here . “track_popularity” is one of the main variables of interest and the summary statistics of the variable has been provided. Variables are not removed in the EDA process, hence a description of all these variables have been provided.

# reading data dictionary file
#a <- read.csv('dat_dict.csv',sep=',')

#knitr::kable(a, align = "lccrr")

# summary statistics of the variable 'track_popularity'
summary(spotify$track_popularity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.00   45.00   42.48   62.00  100.00

4. Exploratory Data Analysis

4.2 Correlation Factor Analysis

Correlation Matrix

Key Insights

  • Track popularity slightly decreases with higher song duration and with high instrumentalness of the track

  • Valence ( positivity) of a track is positively correlated to the danceability of the track

# Correlation matrix
nums <- unlist(lapply(spotify, is.numeric)) 
corrplot(cor(spotify[,c("danceability","valence","speechiness","track_popularity","duration_ms","instrumentalness")]), order = "hclust")

Are podcasts more popular than music tracks in spotify ?

On average podcasts are less popular than msuic tracks on spotify. While the number of podcasts are significantly less than musics tracks. Hence there is a clear opportunity to expand the user base and choices for podcasts on spotify.

dat <-  spotify %>% select (podcast_music_cls , track_popularity , track_id) %>% 
        group_by (podcast_music_cls) %>% summarise(avg_rating = mean(track_popularity),
                                                 track_count = length(unique(track_id)))
dat
## # A tibble: 2 x 3
##   podcast_music_cls avg_rating track_count
##   <chr>                  <dbl>       <int>
## 1 music                   42.5       28326
## 2 podcast                 36.8          26

4.4 Analysis Using Text Mining Techniques

Commonly occuring track titles by popularity

Key Insights

There is a difference in the most frequent title words amongst highly popular music (vs. less popular ones). Tracks with the titles ‘dance’ , ‘hits’ , ‘hip’ , ‘hop’ are more frequent in highly popular tracks than in less popular ones.

Word Cloud (of Titles) for Highly Popular Tracks

# Generating wordcloud of titles for highly popular tracks
dat_h <-  subset(spotify, popularity_class == 'high')

dat_l <-  subset(spotify, popularity_class == 'low')

dat_h <- str_replace_all(dat_h, "[^[:alnum:]]", " ")
dat_h <- str_replace_all(dat_h, "[^a-zA-Z0-9]", " ")

dat_l <- str_replace_all(dat_l, "[^[:alnum:]]", " ")
dat_l <- str_replace_all(dat_l, "[^a-zA-Z0-9]", " ")

docs <- Corpus(VectorSource(dat_h))

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")

# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))

# Remove numbers
docs <- tm_map(docs, removeNumbers)

# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))

# Remove words like 'remix','feat','mix','edit'
title_words <-  c("remix","mix","feat","edit","original","edm","rock","latin","pop","rap","r&b","music")
docs <- tm_map(docs, removeWords, title_words)

# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 

# Remove punctuations
docs <- tm_map(docs, removePunctuation)

# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)

# Text stemming
# docs <- tm_map(docs, stemDocument)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
set.seed(1234)
w1 = wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Word Cloud (of Titles) for Less Popular Tracks

# word cloud for less popular tracks
docs <- Corpus(VectorSource(dat_l))

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")

# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))

# Remove numbers
docs <- tm_map(docs, removeNumbers)

# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))

# Remove words like 'remix','feat','mix','edit'
title_words <-  c("remix","mix","feat","edit","original","edm","rock","latin","pop","rap","r&b","music")
docs <- tm_map(docs, removeWords, title_words)

# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 

# Remove punctuations
docs <- tm_map(docs, removePunctuation)

# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)

dtm <- TermDocumentMatrix(docs)

m <- as.matrix(dtm)

v <- sort(rowSums(m),decreasing=TRUE)

d <- data.frame(word = names(v),freq=v)

set.seed(1234)

w2 <-  wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

w2
## NULL

4.5 Predicting Track Popularity Class

The following section predicts the popularity class (‘high’ , ‘medium’ , ‘low’) of a track. While the predictive model gives a satisfactory performance, the most key element is identifying the importance of various variables in the prediction performance.

We observe that “song_duration” , “instrumentalness” , “loudness” are three key attributes that determine the popularity of a track in spotify. This is also in line with the correlation matrix analysis results.

Random Forest Classification Model

# Train - Test Split 

sample_index <- sample( nrow(spotify),nrow(spotify)*0.70 )

names <- c("popularity_class","playlist_genre","danceability","energy","key","loudness","mode","speechiness",
           "acousticness","instrumentalness","liveness","valence","tempo","duration_ms")

# Using 70% data for training
spotify_train <- spotify[sample_index,names]

# Using 30% data for testing
spotify_test <- spotify[-sample_index,names]

#converting into factors
spotify_train$playlist_genre <-  as.factor(spotify_train$playlist_genre)
spotify_test$playlist_genre <-  as.factor(spotify_test$playlist_genre)
spotify_train$popularity_class <-  as.factor(spotify_train$popularity_class)
spotify_test$popularity_class <-  as.factor(spotify_test$popularity_class)


# Building Random Forest Model

spotify.rf <- randomForest(popularity_class~., data = spotify_train, ntree=500 ,importance=T)
spotify.rf
## 
## Call:
##  randomForest(formula = popularity_class ~ ., data = spotify_train,      ntree = 500, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 37.97%
## Confusion matrix:
##        high  low medium class.error
## high   3069  277   2534   0.4780612
## low     490 1264   3980   0.7795605
## medium  783  661   9921   0.1270568
# prediction performance in test
spotify.rf.pred.test<- predict(spotify.rf, newdata=spotify_test, type = "class")
table(spotify_test$popularity_class, spotify.rf.pred.test, dnn = c("True", "Pred"))
##         Pred
## True     high  low medium
##   high   1334   96   1140
##   low     186  463   1704
##   medium  333  255   4338
# Misclassification rate of 37.8%

Important factors that influence track popularity prediction

#Understanding the most important influencers of popularity
varImpPlot(spotify.rf)

6. Summary of the Analysis

6.1 Summarizing the Problem Statement

Tha analysis mainly focuses on giving key trends and insights to music artists and music distributors that operate within the digital streaming services domain. The general trends in popularity of tracks were identified for different genres of music and different album release eras. The key elements of similarity in terms of “title words” of highly popular tracks were compared against the less popular ones. Additionally the popularity class of a track was predicted and the key influencers of populairty were determined.

6.2 Summarizing the Implementation

  • Packages such as Plotly was used to represent key trends in an interactive fashion

  • Text mininng techniques were used to generate word clouds of track titles (for various popularity classes)

  • Random Forest Classifier was used to predict the popularity class of a track

6.3 Summary of Insights and its Implications (to the consumers of this analysis)

  • While ‘pop’ genre has lower number of total tracks on spotify, it has the highest average popularity score amongst users

  • ‘edm’ and ‘rap’ have higher number of tracks on spotify, hence new artists of these genres on spotify face a stiffer competition in terms of track visibility

  • Track popularity slightly decreases with higher song duration and with high instrumentalness of the track

  • Tracks with higher danceability index tends to have higher valence ( positivity) scores

  • Rock music of the early release eras are more popular than the newly released tracks amongst users

  • The popularity of the genre “R&B” is improving amongst the spotify user base, with newly released tracks of the 2010’s having higher popularity than the tracks released in 2000’s , 90’s and 80’s. Artists of this genre, would hence have a higher scope and visibility amongst spotify users. Music distributors could hence focus more on this genre

  • Tracks with the titles ‘dance’ , ‘hits’ , ‘hip’ , ‘hop’ are more frequent in highly popular tracks than in less popular ones

  • “song_duration” , “instrumentalness” , “loudness” are three key attributes that determine the popularity of a track in spotify

6.4 Limitations of the analysis

  • The analysis is on a static data set. This could be further improved by having data scraped dynamically through a web API

  • User Reviews if available could be used to further understand the key aspects that increase the affinity of user to a music/podcast track

  • While Random Forest Classifier gives a satusfactory prediction performance, we could also look into neural networks to improve the prediction performance