Analyzing Tracks on Spotify

1. Introduction

Spotify is a digital podcast and music streaming service with over 248 million monthly active users across the globe. The paid service of the company known as ‘Spotify Premium’, currently has its user base growing at a staggering rate of +31% (year on year growth). While its keen features make an average user spend 25 hours per month on the service, the data behind the scenes is equally interesting to dive in and learn from. This ‘king of music streaming’ is widely recognized for its personalized music recommendations for its users, and the following analyses look into the key determinants that influence track popularity on Spotify. The analyses is primarily designed to aid firms and music distributors that operate within the digital streaming services domain.

1.1 Problem Statement

There are three main objectives for this analysis.

Identify the general trends in music affinity of Spotify users
Determine key influencers of track popularity on Spotify (by album release era and genre)
Intelligently group music tracks based on its characteristics

1.2 Implementation and Techniques

The dataset contains information on the artist, genre, characteristics, and popularity of various tracks on Spotify. The cleaned data would be analyzed using EDA techniques and data mining techniques (such as regression, word tokenization and clustering) in order to implement the objectives(mentioned in 1.1).

A regression approach such as Random Forest / Multiple Linear Regression would help identify the variable importance in determining track popularity. Tokenization and NLP techniques would help to identify if there is a set of words (in the titles) that come together for more popular tracks. Furthermore, clustering techniques can be employed to better group tracks of the same characteristics.

1.3 Key Consumers of the Analysis

The analyses are primarily designed to aid firms and music distributors that operate within the digital streaming services domain. Identifying user trends would help digital music distributors to better streamline their music offerings. The analysis would also help artists to understand their target consumers better (as the analysis is split by music genre).

2. Packages Required

Following are the packages required with their uses:

tidytext = To convert text to and from tidy formats

DT = HTML display of the data

tidyverse = Allows data manipulation

stringr = Allows string operations

magrittr = Pipe operator in r programming

ggplot2 = For graphical representation in r

dplyr = For data manipulation in r

gridExtra = Allows grod formating of ggplot figures

pracma = Allows advanced numerical analyses

treemap = Allows treemap visualizations

tm = For text mining

GGally = Allows in-depth EDA , works in synchrony with ggplot

randomForest = Creates random forest models

wordcloud = For word cloud generator

plotly = For creating interactive web-based graphs

###########################################
# Installing / Loading Necessary Packages #
###########################################

list_packages <- c("tidytext", "DT", "tidyverse", "stringr", "magrittr", "gridExtra",
                   "pracma", "treemap", "tm", "GGally", "randomForest", "wordcloud",
                   "plotly", "ggplot2", "dplyr", "data.table", "rmarkdown", "tinytex",
                   "knitr")

new_packages <- list_packages[!(list_packages %in% installed.packages()[,"Package"])]

if( length(new_packages) ) install.packages(new_packages)

lapply(list_packages, require, character.only = TRUE)

3. Data Preparation

3.1 Original Data Source

Original Data Source can be found here.

3.2 Explanation of Source Data

The data comes from Spotify and is sourced via the spotifyr package. The package was authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. The main purpose of the package was to obtain general metadata for songs (from Spotify’s API) in an easier fashion. The updated data can be collected via the package, as it runs based on the API. The source data has 32833 observations and 23 variables. The dataset contains 15 missing values and these values have not been imputed in the original dataset. The variable ‘track_id’ is a unique song id, though we see 4477 duplicate values of this column, this is because one song can be associated with multiple genres on Spotify dataset.

3.3 Data Importing and Cleaning

Step 1 : Converting blanks cells to NA while importing the dataset

#################
# Download Data #
#################

spotify <-  read.csv(paste(getwd(),'/Master.csv',sep='') , na.strings = c(""," ",NA))

Step 2 : Identifying Missing Values in the dataset

Missing values were identified for each variable. 15 missing values were identified in the dataset. These observations were removed as it formed a very small proportion (0.0005%) of the dataset.

###################
# Preprocess Data #
###################

# Identifying columns with missing values 

list <-  colnames(spotify)
for (i in 1:length(list)){
                          dat <-  spotify[list[i]]
                          print(paste("Number of missing values in column",list[i]," is : " ,sum(is.na(dat))))
                         }

## [1] "Number of missing values in column track_id  is :  0"
## [1] "Number of missing values in column track_name  is :  5"
## [1] "Number of missing values in column track_artist  is :  5"
## [1] "Number of missing values in column track_popularity  is :  0"
## [1] "Number of missing values in column track_album_id  is :  0"
## [1] "Number of missing values in column track_album_name  is :  5"
## [1] "Number of missing values in column track_album_release_date  is :  0"
## [1] "Number of missing values in column playlist_name  is :  0"
## [1] "Number of missing values in column playlist_id  is :  0"
## [1] "Number of missing values in column playlist_genre  is :  0"
## [1] "Number of missing values in column playlist_subgenre  is :  0"
## [1] "Number of missing values in column danceability  is :  0"
## [1] "Number of missing values in column energy  is :  0"
## [1] "Number of missing values in column key  is :  0"
## [1] "Number of missing values in column loudness  is :  0"
## [1] "Number of missing values in column mode  is :  0"
## [1] "Number of missing values in column speechiness  is :  0"
## [1] "Number of missing values in column acousticness  is :  0"
## [1] "Number of missing values in column instrumentalness  is :  0"
## [1] "Number of missing values in column liveness  is :  0"
## [1] "Number of missing values in column valence  is :  0"
## [1] "Number of missing values in column tempo  is :  0"
## [1] "Number of missing values in column duration_ms  is :  0"

spotify <-  na.omit(spotify)

Step 3 : Understanding preliminary data structure and performing necessary type conversions of variables

Variables that could help aid the impact of text mining, were converted into character variables. The trailing spaces of these variables were removed and the characters were converted to lower case.

#  identifying structure of the dataset

str(spotify)

#  necessary type conversions

spotify$track_id <- tolower(trimws(as.character(spotify$track_id)))

spotify$track_name <- tolower(trimws(as.character(spotify$track_name)))

spotify$track_artist <- tolower(trimws(as.character(spotify$track_artist)))

spotify$track_album_id <- tolower(trimws(as.character(spotify$track_album_id)))

spotify$track_album_name <- tolower(trimws(as.character(spotify$track_album_name)))

spotify$playlist_name <- tolower(trimws(as.character(spotify$playlist_name)))

spotify$playlist_id <- tolower(trimws(as.character(spotify$playlist_id)))

spotify$playlist_genre <- tolower(trimws(as.character(spotify$playlist_genre)))

spotify$playlist_subgenre <- tolower(trimws(as.character(spotify$playlist_subgenre)))

Step 4 : Identifying duplicate observations if any

A user defined function is created to remove duplicate observations in the dataset. No duplicate observations were found in the dataset. Though the variable ‘track_id’ (while being a unique song identifier) has 4477 duplicate values because each song can be associated with multiple genres on spotify. Hence no manipulation was done on this variable so as to retain the association between tracks and genres.

#  identifying duplicate observations in the dataset

# user defined function to look for duplicated in the data
func_duplicate <-  function(x){if(sum(duplicated(x))>0)
                                 {x<- x %>% distinct()}else{print("No duplicate observations found in the dataset")}
                              }

func_duplicate(spotify)

## [1] "No duplicate observations found in the dataset"

3.4 Cleaned Dataset

Please find below a sample from the cleaned dataset.

#  Outputting Cleaned Data
knitr::kable( head(spotify,3), align = "lccrr", caption = "A sample data")

A sample data
track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	playlist_name	playlist_id	playlist_genre	playlist_subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
6f807x0ima9a1j3vpbc7vn	i don’t care (with justin bieber) - loud luxury remix	ed sheeran	66	2ocs0dgtsro98gh5zsl2cx	i don’t care (with justin bieber) [loud luxury remix]	6/14/2019	pop remix	37i9dqzf1dxczdd7cfekhw	pop	dance pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754
0r7cvbztwzgbtcydfa2p31	memories - dillon francis remix	maroon 5	67	63rpso264urjw1x5e6cwv6	memories (dillon francis remix)	12/13/2019	pop remix	37i9dqzf1dxczdd7cfekhw	pop	dance pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600
1z1hg7vb0ahhdiemnde79l	all the time - don diablo remix	zara larsson	70	1hosmj2elcsrr0ve9gthr4	all the time (don diablo remix)	7/5/2019	pop remix	37i9dqzf1dxczdd7cfekhw	pop	dance pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616

3.5 Summary of the Variables

Below is the summary of concerned variables for the analysis. The primary source of the data description can also be found here . “track_popularity” is one of the main variables of interest and the summary statistics of the variable has been provided. Variables are not removed in the EDA process, hence a description of all these variables have been provided.

# reading data dictionary file
a <- read.csv(paste(getwd(),'/dat_dict.csv',sep=''))

knitr::kable(a, align = "lccrr")

variable	class	description
track_id	character	Song Unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	double	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	Factor	Date when album released
playlist_name	character	Name of playlist
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	num	Danceability describes how suitable a track is for dancing. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	num	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
key	int	The estimated overall key of the track.
loudness	num	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.
mode	int	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	num	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	num	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	num	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	num	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	num	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	num	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	num	Duration of song in milliseconds

# summary statistics of the variable 'track_popularity'
summary(spotify$track_popularity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.00   45.00   42.48   62.00  100.00

4. Proposed Exploratory Data Analysis

This section contains an outline for the EDA plan (for the final project).

4.1 EDA plan summary

Exploratory Data Analysis would consist of the following steps:

Understanding trends such as count of artists, tracks by genre - This to understand if songs available on
Spotify is skewed in any manner
Feature engineering - Creating new variables to classify the era of album release . This would help identify trends by the album era. An analysis would also be done to understand if factors that affect the popularity of tracks varies by album release era
Understanding the distribution of track_popularity across various genres using kernel density plots:
- Is the popularity distribution of a genre closer to its mean (vs. others)? This would suggest that tracks of this genre has a higher probability (for the same confidence interval) of being popular amongst Spotify
  users
Understanding Spotify user base through correlation factor analysis :
- Are danceable tracks more popular on Spotify ?
- Are tracks conveying higher positivity (high valence) more popular?
- Do users rate audiobooks/ poetry/ podcasts higher than songs on Spotify (analysis using speechiness variable)?
Do we have any common trends in highly popular/ less popular artists on Spotify?
- The common music keys used in highly popular album artists (analyzed using the key variable)
- Bag of words analysis of music titles / podcast titles within highly popular tracks ( Is this different from the bag of words obtained in less popular tracks)
Have the characteristics of a music genre changed over time?
- Understanding changes in key variables( such as danceability, key, loudness) of a genre over different eras of album release
Analysing the presence of ‘Network Effect’ on the popularity of artists:
- Do highly popular artists have a significantly higher number of tracks vs.(others). If the difference is significant we can further investigate if the higher number of tracks causes an increase in the visibility of an artist and hence improves their overall popularity
Chi Square test to understand the significance of various variables in determining the track popularity

4.2 Types of Plots and Tables

For the EDA steps in section 4.1 the following plot types and tables would be used:

Kernel Density Plots ( to visualize distribution of a numeric variable grouped by a categorical variable)
Lollipop Charts ( to visualize category wise relative ranking in an analysis)
Bar Charts ( to visualize category wise relative ranking in an analysis)
Connected scatter plots ( to visualize change in popularity of a genre by release date)
Tree map ( to visualize the major categories based on a numeric variable)
Data Tables to represnt bag of words analyses
Correlation matrix tables ( to visualize correlation factors in an n*n matrix format)
Boxplots ( Visualize distribution of a numeric variable)

4.3 Techniques to be Learnt

I would need to improve my skills in the Natural Language Processing arena of analytics, to deliver more impactful and insightful analyses from the dataset.

4.4 Data Mining Techniques to be Incorporated

Regression techniques to identify key factors that determine track popularity: The analysis would include regression techniques such as Random Forests, Multiple Linear Regression, Boosting( selected based on best prediction performance)
Text mining techniques: Additionally, Natural Language Processing techniques such as word tokenization, word clouds, TF-IDF would be
employed to understand the common song title words amongst highly popular tracks
Clustering techniques: Furthermore, unsupervised learning techniques such as clustering would be employed to intelligently group tracks having the same characteristics.

References

https://www.statista.com/statistics/813876/spotify-monthly-active-users-time-spent-listening/

https://qz.com/1736762/spotify-grows-monthly-active-users-and-turns-profit-shares-jump-15-percent/

https://www.tunefab.com/spotify/spotify-bluetooth.html

Analyzing Tracks on Spotify

BANA 7025 Data Wrangling in R - Midterm Project

Anjali Shalimar ( M13473173 )

4/3/2020

1. Introduction

1.1 Problem Statement

1.2 Implementation and Techniques

1.3 Key Consumers of the Analysis

2. Packages Required

3. Data Preparation

3.1 Original Data Source

3.2 Explanation of Source Data

3.3 Data Importing and Cleaning

3.4 Cleaned Dataset

3.5 Summary of the Variables

4. Proposed Exploratory Data Analysis

4.1 EDA plan summary

4.2 Types of Plots and Tables

4.3 Techniques to be Learnt

4.4 Data Mining Techniques to be Incorporated

References

Analyzing Tracks on Spotify

BANA 7025 Data Wrangling in R - Midterm Project Anjali Shalimar ( M13473173 )

4/3/2020

1. Introduction

1.1 Problem Statement

1.2 Implementation and Techniques

1.3 Key Consumers of the Analysis

2. Packages Required

3. Data Preparation

3.1 Original Data Source

3.2 Explanation of Source Data

3.3 Data Importing and Cleaning

3.4 Cleaned Dataset

3.5 Summary of the Variables

4. Proposed Exploratory Data Analysis

4.1 EDA plan summary

4.2 Types of Plots and Tables

4.3 Techniques to be Learnt

4.4 Data Mining Techniques to be Incorporated

References

BANA 7025 Data Wrangling in R - Midterm Project

Anjali Shalimar ( M13473173 )