The Case for Augmenting User Experience by Studying Song Associations

Introduction

The world of media arts consumption has gone through dramatic changes over these past 30 years. In this new era of instant & boundless content access, everything from the projects that get produced, the way they are produced, and the user experience have gone through astounding metamorphosis. Media companies that have risen to the top and managed to stay are those who understood early on that user attraction & retention is the single most important pillar to success.

Netflix’s provides a perfect illustration of this. They have consistently prioritized refining their original content creation process and sophisticated suggestion algorithms in spite of the naysayers resoundingly ensuring the company’s demise. After 23 years of consistent quarterly debt raising, negative net income & terrible financial health ratios, it did not take long for the great majority of analysts to think Netflix was in severe crash-and-burn trouble. This year, Netflix made an announcement of major implications to the media industry: they no longer require any more rounds of borrowed capital to keep operations going. This means, should everything continue as it has, they are in an excellent position to service their debt, have budget for their extravagant productions AND start building tangible profit. The “risk” they took for putting the user experience first after so many years has literally paid out. They have taken their rightful position as the streaming audiovisual content powerhouse because of it.

Inspired by this, we have chosen to examine Spotify’s data repository to explore the song attributes they choose to capture and construct a scoring measure by which to group songs together and use it as a base for song recommendations.

The relevance of this exploration is two-fold. Spotify’s user experience offering is largely dependent on song curation & opening multiple avenues by which the user can explore new music they will actually like. In that sense, deep-diving into the trends & characteristics that associate certain songs to others will propulse Spotify’s ability to accurately churn out music recommendations and streamline their vetting process of where to post new songs as these get added to the platform.

On the other side, this study means precious insight to those in the music creation side of the picture. The plethora of avenues available to post & to find new music leaves up-and-coming artists with no choice but to pay others to figure out a marketing strategy, or otherwise go years without finding a solid way to gain listeners. Our study provides a simplified approach to remedy that: a scoring system that allows anyone to understand the musical characteristics that a given song boils down to & the type of music associated to these base attributes. This information can be capitalized by the creator to find the venues and platforms were these associated songs have presence and take a share in those markets. A creator can also gain understanding of how to manipulate their content to achieve a desired musical association. In general, these types of multifaceted insights are usually guarded by label executives. This study enables the artist to take back some of that control.

Data Preparation

The Spotify Web API provides artist, album, and track data, as well as audio features and analysis, all easily accessible via the R package spotifyr.

It’s likely that Spotify uses these features to power products like Spotify Radio and custom playlists like Discover Weekly and Daily Mixes. Spotify has the benefit of letting humans create relationships between songs and weigh in on genre via listening and creating playlists.

Libraries and Data Input

library(knitr) # Used to create a document that is a mixture of text and some chunks of code
library(kableExtra) #Used for enhancing the aesthetics of the table outputs in the document
library(tidyverse) #Used for faster and easier Data manipulation and Wrangling tasks
library(dplyr) #Used for faster and easier Data manipulation and Wrangling tasks

There are 23 audio features for each track, including confidence measures like acousticness, liveness, speechiness and instrumentalness, perceptual measures like energy, loudness, danceability and valence (positiveness), and descriptors like duration, tempo, key, and mode.

A brief description of the variables is as mentioned below:

url <- 'https://raw.githubusercontent.com/vpcincin/DataWrangling/main/Data_Dictionary.csv'
spotify_dictionary <- readr::read_csv(url)
kable(spotify_dictionary[, ], format = "simple")

variable	class	description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	double	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Name of playlist
playlist_id	character	Playlist ID
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	double	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	double	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C?/D?, 2 = D, and so on. If no key was detected, the value is -1.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	double	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	double	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	double	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	double	Predicts whether a track contains no vocals. <U+0093>Ooh<U+0094> and <U+0093>aah<U+0094> sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly <U+0093>vocal<U+0094>. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	double	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	double	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	double	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	double	Duration of song in milliseconds

Read the data into dataframe

#Ceating the "spotify_songs" dataframe below by reading the data from github
spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

The Data is scraped for 2021-01-21 i.e week 4 of 2020.

Dimension of Data

The dataset has 32833 records at a Track-Genre-Artist level and 23 variables.

dim(spotify_songs)

## [1] 32833    23

Null Values in the Dataset

Surprisingly, there are only a total of 15 Null values in the 32833 X 23 dataframe, which is amazing considering it is a real dataset. When we deep dived into the null values across columns we see that - trac_artist, track_name and track_album_name have 5 Null values each.

#getting the count of total null values in data 
sum(is.na(spotify_songs))

## [1] 15

#getting null values by columns
colSums(is.na(spotify_songs))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

Data Cleaning

Variable Names

We looked at the names of all columns to check if the names are intuitive and if they follow a uniform naming convention , one commonly used in R. We decided to use snake case throughout the project.

#printing variable names
names(spotify_songs)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

The Variable names look consistent and easy to interpret, hence there is no need for any variable name change.

Variable Types

It is vital to have the correct data type for each column prior to any analysis. Hence, we used str() to observe the data types of each column and changed the data type wherever necessary. Below are the observations:

# checking variable types for consistencies 
str(spotify_songs[])

## tibble [32,833 x 23] (S3: tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32833] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32833] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32833] 194754 162600 176616 169093 189052 ...

Observations:

mode currently has a numeric field, however it is a factor/Boolean variable, as it has values{0,1}.
track_album_release_dateis currently a character column but its actually a field with date values; Its vital to change this as we would need this column in date format for analysis- Example: Time series plots, Y-o-Y growth analysis,etc.

Thus, in the following section of code, we manually change the datatypes for the two columns to the required type.

#Modyfying Data types
spotify_songs$mode <- as.factor(spotify_songs$mode)
spotify_songs$track_album_release_date <- as.Date(spotify_songs$track_album_release_date)

We can confirm that the conversion of data type has reflected in the data.

#QC step
class(spotify_songs$mode)

## [1] "factor"

class(spotify_songs$track_album_release_date)

## [1] "Date"

Missing Value Treatment

We had earlier seen that we had 5 missing values each in track_artist, track_album_name and track_name. We impute these missing values with a character constant ‘unknown’. Further, these missing values do not pose a serious threat to any of the analysis that we expect to oerform in the future. This is because of two primary reasons - a) It is a small fraction in our dataset b) We still have a lot of information for these records that we can use for our EDA

#Missing Value Treatment
spotify_songs$track_artist[is.na(spotify_songs$track_artist)] <- 'unknown'
spotify_songs$track_album_name[is.na(spotify_songs$track_album_name)] <- 'unknown'
spotify_songs$track_name[is.na(spotify_songs$track_name)] <- 'unknown'

We see now that we do not have any missing values in our data.

#QC step
sum(is.na(spotify_songs))

## [1] 1886

Numeric and Visual Summary

Looking at the summary of the numeric data provides us a high-level understanding of data distribution and centrality. While glancing through the summary, we can also quickly get an idea about which columns to thoroughly inspect for outliers.

#Generating summary
summary(select_if(spotify_songs,is.numeric))

##  track_popularity  danceability        energy              key        
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.000175   Min.   : 0.000  
##  1st Qu.: 24.00   1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000  
##  Median : 45.00   Median :0.6720   Median :0.721000   Median : 6.000  
##  Mean   : 42.48   Mean   :0.6548   Mean   :0.698619   Mean   : 5.374  
##  3rd Qu.: 62.00   3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000  
##  Max.   :100.00   Max.   :0.9830   Max.   :1.000000   Max.   :11.000  
##     loudness        speechiness      acousticness    instrumentalness   
##  Min.   :-46.448   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.: -8.171   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median : -6.166   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   : -6.720   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.: -4.645   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :  1.275   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

From the descriptive statistics of the numeric variables that we obtained above, we see that for some variables the mean is not very close to the median, which indicates the skewness in the data and further hints towards the possibility of potential outliers. We also can get an idea of whether the outlier is towards the lower bound or the upper bound in the data, i.e Right skewness (Mean > Median) suggests outliers towards the upper bound and Left skewness (Mean < Median) suggests that the outliers are towards the lower bound.

To further check if the variables have outliers in the data we plot the distribution of these variables using boxplots (In the visual summary secion)

For the character variables, we explore the number of levels/distinct value in each variable.

#checking levels for character variables
ulst <- lapply(select_if(spotify_songs,is.character),unique)
k <- lengths(ulst)
k

##          track_id        track_name      track_artist    track_album_id 
##             28356             23450             10693             22545 
##  track_album_name     playlist_name       playlist_id    playlist_genre 
##             19744               449               471                 6 
## playlist_subgenre 
##                24

Concerns from observation above:

track_id and track_name have different number of unique values
The data is not at track_id level; Track_id’s are repeated

Visual Summary

The box plots of variables will be helpful in outlier detection. In the analysis above, we observe that: few columns have the mean pulled towards on side due to outliers or skewness. Here we will be checking the boxplots of these variables to identify outliers and subsequently device a strategy to treat them.

#Generating boxplots
boxplot(spotify_songs$danceability, main = 'Boxplot distribution of Danceability')

boxplot(spotify_songs$loudness, main = 'Boxplot distribution of loudness')

boxplot(spotify_songs$tempo , main = 'Boxplot distribution of tempo')

Outlier Detection and Treatment

From the boxplot distributions we see that the variable danceability has one value at 0, which stands out from the remaining of the variable. Similarly in loudness there is one value that is very low ‘-46’ and in tempo there is one value that is too high and one value that is too low than the majority of data points.

We can remove these records, to trim tails of these variables. As these are just a couple of records, it would not be harmful in terms of data loss and it is safe to remove these records from the dataset and visualize the dataset again to see the change in distribution.

#Trimming Outliers 
new_df <- subset(spotify_songs, danceability > min(danceability) & loudness > min(loudness) & tempo > min(tempo) & tempo < max(tempo))

#Visualizing the distributions again
boxplot(new_df$danceability, main = 'Distribution of Danceability')

boxplot(new_df$loudness, main = 'Distribution of loudness')

boxplot(new_df$tempo , main = 'Distribution of tempo')

Thus, in Data Cleaning, we have checked the variable types, imputed the missing values, we checked the numerical summaries and detected and treated the outliers.

The below table shows a glimpse of the final cleaned dataset.

#printing head
knitr::kable(head(new_df,5), "simple")

track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	playlist_name	playlist_id	playlist_genre	playlist_subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
6f807x0ima9a1j3VPbc7VN	I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	2oCs0DGTsRO98Gh5ZSl2Cx	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754
0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600
1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616
75FpbthrwQmzHlBJLuGdC7	Call You Mine - Keanu Silva Remix	The Chainsmokers	60	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.718	0.930	7	-3.778	1	0.1020	0.0287	9.40e-06	0.2040	0.277	121.956	169093
1e8PAfcKUYoKkxPhrHqw4x	Someone You Loved - Future Humans Remix	Lewis Capaldi	69	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052

Relevant Variables:

Since we are trying to build a recommendation engine to suggest similar songs to the Spotify user, we would need track attributes to characterize each track. Below are few potential variables that could be important to generate a similarity score for the tracks.

track_artist, track_popularity, track_album_name, playlist_genre, playlist_subgenre, danceability,loudness,acousticness, tempo, liveness, duration_ms

We plot the distributions of the variables of interest for our analysis.

#histogram for variables of interest 
hist(new_df$track_popularity, main = 'Distribution of Track Popularity', xlab = 'Track popularity', col = 'light blue')

hist(new_df$danceability, main = 'Distribution of Dancebility', xlab = 'Danceability', col = 'light blue')

hist(new_df$loudness, main = 'Distribution of loudness', xlab = 'loudness', col = 'light blue')

The data has track duration in milliseconds which is too granular for us. Hence, we convert it to minutes and then plot the distribution. As expected, we see most tracks are between 3-5 minutes.

#converting the song duration in minutes
new_df$duration_min <- new_df$duration_ms/60000

hist(new_df$duration_min, main = 'Distribution of duration in mins', xlab = 'loudness', col = 'light blue')

However, We still have a few concerns with the data that we need to investigate prior to analysis:
Concerns {Mentioned above}

track_id and track_name have different number of unique values
The data is not at track_id level; Track_id’s are repeated

Exploratory Data Analysis

The Spotify data is a rich dataset with playlist information from tracks ranging from 1905 to 2020. To deep dive into the data and look for interesting trends, we can look at the dataset by aggregating it into different levels- this would give us a feel of the data distribution at different levels. For our EDA, we will be considering the following cuts/levels that may help us to answer all/some of the following brainstormed business questions -

Overall level:

What is the depth and breadth of Spotify dataset?
How diverse is the Spotify dataset?
What are the most popular words used in track names in Spotify data?
What is the split between individual tracks, dual tracks, and bands?

By Genre:

What attributes differentiates the genres?
What is the most/least popular genre?
Are more artists choosing a specific type of genre?
Which is the most popular album by Genre?
Are there multiple genres that typically go along well?

By Artist:

Who is the most popular artist?
Do artists generally stick to one kind of genre or explore multiple genres? If yes, how many do so?
Which artist has the greatest number of songs?
What is the average time gap between 2 songs by an artist?

By Popularity (We plan to create a categorical feature – Popularity category: High, Medium, Low based on popularity score)

Are there specific attributes in the data that lead to higher/lower popularity?

By Year/Month

How has the genres evolved over time? Is the popularity changing over time?
Are more popular songs from older times or recent times? Or does time have no effect on popularity?
Do artists prefer releasing the track/album on any specific month/(s)
Is releasing in any specific month/(s) better for a track/album (in terms of popularity)?

Slicing, Dicing & Features Engineering

To answer all/some of the questions above, we must slice and dice the data and look at them from multiple dimensions. The dataset seems to be at a Track-Artist-Genre level and has various metrics for each track. However, to be able to track how well a certain song did for an artist, we need to look at performance in comparison to other tracks the same artist/genre – ‘Relative Metric’. For this, we plan to create aggregated views of the data at Artist and Genre level with fields like -:

The following shows a glimpse of the types of views we are considering to asnswer the key business questions -

We further intend to engineer features for relative track performance by joining the aggregated views with the data set:

Artist_Relative_[Metric] = Metric / Artist_Average_Metric
Genre_Relative_[Metric] = Metric / Genre_Average_Metric

We will add these metrics to the EDA mentioned above to get a better relative understanding of the track and its performance.

Plots & Tables Required

To create the time series plots to observe trends over time, we will be utilizing line graphs in R
Could also plot time series box plots- Need to check feasibility in R
To spot outliers in the data, we will be using dotplots with quantiles
For plotting relationships/ pair plots we are going to use dotplots to observe distribution and trends.
We will be using the following aggregated table to compute relative performance metrics:

ML techniques

We plan to create a recommendation engine using the data set provided. The engine would take as input a list of tracks a user listens to, and leverage that to predict ‘Similar’ songs that the user ‘might’ like.

Learning Gap

We still need to learn to how to calculate similarity scores in R – to be used for recommendation engine.
We also want to use time series box plots. Need to check feasibility in R. If feasible, need to learn how to create in R.

Spotify Song Data Analysis

Aman Prabhakar, Ashish Jaisimha, Rohan Sharma, Vaishali Pawar , Vanessa Murillo

November 2, 2021