Presented by - Daya Nayak, Shubham Jha, Maulik Patel

Analysis of Songs on Spotify

Introduction

Background

Spotify is a Swedish-based audio streaming and media services provider, which launched in October 2008. It is now one of the biggest digital music, podcast, and streaming service in the world that gives access to millions of songs from artists all over the world. It has also started producing music albums and events across the world.

Spotify offers over 70 million tracks, and 2 million podcasts to more than 300 million monthly users. As a freemium service, basic features are free with advertisements and limited control, while additional features, such as offline listening and commercial-free listening, are offered via paid subscriptions. Users can search for music based on track, artist, album, or genre, and can create, edit & share playlists.

Objective & Proposed Analytical Methodology

The scope of the project is to determine whether the popularity of a song is based on genre/sub-genre and audio features like loudness, speechiness, danceability etc. This analysis can provide insights on what features make a song popular and can also help in music recommendation to the users. We would also like to try finding groups among the songs, based on it’s features. This would in-turn help the user in creating the playlist, by suggesting him/her the songs, based on the selected song.

We plan to explore following aspects for the data analysis:

  • Identify relationship between music features (both categorical and numerical).
  • Identifying each genre’s features and how Spotify classifies genres.
  • Do certain factors make a song more or less popular?.
  • Also try to see a trend analysis of all the features throughout the years.
  • Find any correlation of the features amongst each other.
  • Look how different features impact the song’s popularity.

What we hope to Achieve with the analyses?

These insights will help us in coming up with a model to predict popularity score of the song given its features. Helpful when composing music to see what factors play heavily into popularity. This could help Artists on what factors to focus on for maximising chance at Popularity.

Build a simple recommender-system that suggests similar songs based on the user’s preferences and listening habits. Helpful in discovering new songs and playlist creation.

Packages Used

The following packages are used in the analysis:

  • Tidyverse - Collection of R packages for data manipulation, exploration and visualization.

  • ggplot2 - Used for plotting charts.

  • plotly - For web-based graphs via the open source JavaScript graphing library plotly.js for interactive charts

  • factoextra - To visualize the output of multivariate data analysis

  • funModeling - Exploratory Data Analysis and Data Preparation Tool-Box

  • RColorBrewer - To help you choose sensible colour schemes for figures in R

  • ggplot2 - ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.

  • Lubridate - It is a package that eases working with Date and Time datatypes

  • Knitr - it is a package in the statistical programming language R that enables integration of R code into LaTeX, LyX, HTML, Markdown, AsciiDoc, and reStructuredText documents

  • DT - Data objects in R can be rendered as HTML by importing this package.

  • cowplot - For providing addition functionalities to ggplot.

  • wordcloud - Creates wordclouds

  • corrplot - It is used for creating correlation matrix, to find colinearity between different features

  • kableExtra - allows users to construct complex tables and customize styles using a readable syntax.

  • imager - allows to load images from a publically available URL.

Data Preparation

Data Source

The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata around songs from Spotify’s API.

A subset of the data had already been extracted and is available for access on Github on which the analysis has been done. The song database consists of songs, its popularity, artists, the album to which the song belongs to from 6 main genres (EDM, Latin, Pop, R&B, Rap, and Rock) from Jan 1957 to Jan 2020.

Reading the Data from the source file

spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

Data Dictionary

The dataset used comprises of around 32,833 songs (with 23 features) on Spotify from 1957-2020 and was updated on 1/21/2020. This medium article does a great job of explaining the various “audio features” that Spotify links to a song.

Here is the data dictionary as a reference:

Preliminary Data Cleaning and Summary

1) Analysing the original data-set
### Checking dimension of Data
dim(spotify_songs)
## [1] 32833    23
### Checking structure of Data
str(spotify_songs)
## spec_tbl_df [32,833 x 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32833] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32833] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32833] 194754 162600 176616 169093 189052 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   track_id = col_character(),
##   ..   track_name = col_character(),
##   ..   track_artist = col_character(),
##   ..   track_popularity = col_double(),
##   ..   track_album_id = col_character(),
##   ..   track_album_name = col_character(),
##   ..   track_album_release_date = col_character(),
##   ..   playlist_name = col_character(),
##   ..   playlist_id = col_character(),
##   ..   playlist_genre = col_character(),
##   ..   playlist_subgenre = col_character(),
##   ..   danceability = col_double(),
##   ..   energy = col_double(),
##   ..   key = col_double(),
##   ..   loudness = col_double(),
##   ..   mode = col_double(),
##   ..   speechiness = col_double(),
##   ..   acousticness = col_double(),
##   ..   instrumentalness = col_double(),
##   ..   liveness = col_double(),
##   ..   valence = col_double(),
##   ..   tempo = col_double(),
##   ..   duration_ms = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
### summarising the Data set and features
summary(spotify_songs)
##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810
### checking for NULLs or missing values 
colSums(is.na(spotify_songs))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
2) Cleaning the Data Set

Null values: As we can see that the track_name,track_album_name and track_artist variables contain 5 missing values. Out of 32833 rows, we can remove 5 rows without any significant impact on our data.

Duplicate Data: We observed that some of the songs have been repeated more than once in this dataset. Out of 32,833 songs, only 28,352 songs are unique. They have the same ‘track_id’ but have a different ‘playlist_id’. So we need to remove those duplicated songs in the dataset. Since the song’s ‘track_id’ is the unique identifier for a song and the other numeric and categorical features of that song remains the same, we will delete those duplicated songs based on the ‘track_id’.

#### Removing NULL values from the data 
spotify_songs <- na.omit(spotify_songs)

#### Changing datatype of some categorical columns from string to factor. 
#### This is done as factors are less in number compared to regular string (e.g. song name) 
#### and factors are used for categorical data analysis
spotify_songs <-spotify_songs %>%
  mutate(playlist_genre=as.factor(spotify_songs$playlist_genre),
         playlist_subgenre=as.factor(spotify_songs$playlist_subgenre),
         mode=as.factor(mode),
         key=as.factor(key))

#### removing duplicated data 
spotify_songs <- spotify_songs[!duplicated(spotify_songs$track_id),]
dim(spotify_songs)
## [1] 28352    23
### summarising the Data set and features
summary(spotify_songs)
##    track_id          track_name        track_artist       track_popularity
##  Length:28352       Length:28352       Length:28352       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 21.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 42.00  
##                                                           Mean   : 39.34  
##                                                           3rd Qu.: 58.00  
##                                                           Max.   :100.00  
##                                                                           
##  track_album_id     track_album_name   track_album_release_date
##  Length:28352       Length:28352       Length:28352            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre
##  Length:28352       Length:28352       edm  :4877    
##  Class :character   Class :character   latin:4136    
##  Mode  :character   Mode  :character   pop  :5132    
##                                        r&b  :4504    
##                                        rap  :5398    
##                                        rock :4305    
##                                                      
##                  playlist_subgenre  danceability        energy        
##  southern hip hop         : 1582   Min.   :0.0000   Min.   :0.000175  
##  indie poptimism          : 1547   1st Qu.:0.5610   1st Qu.:0.579000  
##  neo soul                 : 1478   Median :0.6700   Median :0.722000  
##  progressive electro house: 1460   Mean   :0.6534   Mean   :0.698372  
##  electro house            : 1416   3rd Qu.:0.7600   3rd Qu.:0.843000  
##  gangster rap             : 1314   Max.   :0.9830   Max.   :1.000000  
##  (Other)                  :19555                                      
##       key           loudness       mode       speechiness      acousticness   
##  1      : 3436   Min.   :-46.448   0:12318   Min.   :0.0000   Min.   :0.0000  
##  0      : 3001   1st Qu.: -8.310   1:16034   1st Qu.:0.0410   1st Qu.:0.0143  
##  7      : 2907   Median : -6.261             Median :0.0626   Median :0.0797  
##  9      : 2631   Mean   : -6.818             Mean   :0.1079   Mean   :0.1772  
##  11     : 2577   3rd Qu.: -4.709             3rd Qu.:0.1330   3rd Qu.:0.2600  
##  2      : 2478   Max.   :  1.275             Max.   :0.9180   Max.   :0.9940  
##  (Other):11322                                                                
##  instrumentalness       liveness         valence           tempo       
##  Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:0.0000000   1st Qu.:0.0926   1st Qu.:0.3290   1st Qu.: 99.97  
##  Median :0.0000207   Median :0.1270   Median :0.5120   Median :121.99  
##  Mean   :0.0911294   Mean   :0.1910   Mean   :0.5104   Mean   :120.96  
##  3rd Qu.:0.0065725   3rd Qu.:0.2490   3rd Qu.:0.6950   3rd Qu.:134.00  
##  Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910   Max.   :239.44  
##                                                                        
##   duration_ms    
##  Min.   :  4000  
##  1st Qu.:187741  
##  Median :216933  
##  Mean   :226575  
##  3rd Qu.:254975  
##  Max.   :517810  
## 

Redundant Columns : Now since we don’t have duplicate records, and we would like to analyze which features influence the ‘track_popularity’, we can drop the following columns which are not useful in our analyses:

  • track_id
  • track_album_id
  • track_album_name
  • playlist_id
  • playlist_name
  • playlist_subgenre
#### Dropping Redundant Columns
spotify_songs <- spotify_songs %>% select(-c(track_id, track_album_id,
                                             track_album_name, 
                                             playlist_id, playlist_name,
                                             playlist_subgenre))

Data Manipulation :

spotify_songs$track_album_release_date <- as.character(spotify_songs$track_album_release_date, "%m/%d/%Y")
spotify_songs$year <- substr(spotify_songs$track_album_release_date,1,4)

#### changing data type of year column
spotify_songs$year <- as.numeric(spotify_songs$year)


### Checking structure of Data
str(spotify_songs)
## tibble [28,352 x 18] (S3: tbl_df/tbl/data.frame)
##  $ track_name              : chr [1:28352] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:28352] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:28352] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_release_date: chr [1:28352] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_genre          : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ danceability            : num [1:28352] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:28352] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : Factor w/ 12 levels "0","1","2","3",..: 7 12 2 8 2 9 6 5 9 3 ...
##  $ loudness                : num [1:28352] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 2 2 ...
##  $ speechiness             : num [1:28352] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:28352] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:28352] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:28352] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:28352] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:28352] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:28352] 194754 162600 176616 169093 189052 ...
##  $ year                    : num [1:28352] 2019 2019 2019 2019 2019 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 8152 9283 9284 19569 19812
##   ..- attr(*, "names")= chr [1:5] "8152" "9283" "9284" "19569" ...
3) Cleaned Data set

A preview of the clean data-set is given below:

### displaying top 100 rows
output_data <- head(spotify_songs, n = 100)
datatable(output_data, filter = 'top', options = list(pageLength = 25))

Exploratory Data Analysis

1) Summary by Genre:

Let’s see the distribution of songs across different genres. Which genre has the most number of songs in the dataset?

# songs per genre
spotify_songs %>% group_by(Genre = playlist_genre) %>%
  summarise(No_of_tracks = n()) %>% 
  arrange(desc(No_of_tracks)) %>% knitr::kable()
Genre No_of_tracks
rap 5398
pop 5132
edm 4877
r&b 4504
rock 4305
latin 4136

Rap is the genre in which most songs have been released, followed by Pop and then EDM.

Let’s look at Artists with most track releases:
# artists with most releases
highest_tracks <- spotify_songs %>% group_by(Artist = track_artist) %>%
  summarise(No_of_tracks = n()) %>%
  arrange(desc(No_of_tracks)) %>%
  top_n(15, wt = No_of_tracks) %>% 
  ggplot(aes(x = Artist, y = No_of_tracks)) +
        geom_bar(stat = "identity") +
        coord_flip() + labs(title = "Artists With The Most Track Releases", x = "Artist", y = "# of Tracks")

ggplotly(highest_tracks)

Queen(Tracks = 130), Martin Garrix(Tracks = 87), Don Omar (Tracks = 84) are one of the Top Artists, with the Most Track Releases across the years. The Top 15 Artists are shown here.

Most frequent track names:
#Create a vector containing only the text
name <- spotify_songs$track_name 
# Create a corpus  
corpus_ <- Corpus(VectorSource(name))

#clean text data - remove suffix and adjectives
corpus_ <- corpus_ %>%
        tm_map(removeNumbers) %>%
        tm_map(removePunctuation) %>%
        tm_map(stripWhitespace)
corpus_ <- tm_map(corpus_, content_transformer(tolower))
corpus_ <- tm_map(corpus_, removeWords, stopwords("english"))
corpus_ <- tm_map(corpus_, removeWords,c("feat","edit","remix","remastered","remaster","radio","version","original","mix"))

#create a document-term matrix

dtm <- TermDocumentMatrix(corpus_) 
dtm_matrix <- as.matrix(dtm) 
words <- sort(rowSums(dtm_matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

#generate the word cloud
wordcloud(words = df$word, freq = df$freq,scale=c(8,0.25), min.freq = 1,
          max.words=150, random.order=FALSE, rot.per=0.25, 
          colors=brewer.pal(8, "Dark2"))

Love is the most frequently used word in the title of the song followed by don’t and like.

Distribution of song popularity across genres
#popularity among genres
popularity_vs_genre_plot<- ggplot(spotify_songs, aes(x = playlist_genre, y =
                                                 track_popularity)) +
        geom_boxplot() +
        coord_flip() +
        labs(title = "Popularity across genres", x = "Genres", y = "Popularity")

ggplotly(popularity_vs_genre_plot)

Based on the Median values, it can be seen that the Pop is the most popular genre amongst the others. It is closely followed by latin and rap.

Songs released over the years
# grouping tracks by years

tracks_year <- spotify_songs %>% 
  select(year) %>%
  filter(year<2020) %>%
  group_by(year) %>%
  summarise(count = n()) 

#plot of tracks released across the years

tracks_vs_year <- ggplot(tracks_year,aes(x = year, y = count,group = 1)) + 
  geom_line() +
  theme(legend.position = "none",axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Release of songs across years", x = "Year", 
       y = "No of songs released")

ggplotly(tracks_vs_year)

We see that almost 75% of the songs were released in 21st century. The adent of the internet and the audio streaming services may have caused this drastic increase in the production of songs in the 21st Century.

Popularity of genres over the decades

Are the same genre of songs popular over the years or has the people’s taste in music changed over the years?

## Find popular genres over the decades
spotify_1 <- spotify_songs %>%
  select(track_popularity,year,playlist_genre) %>%
  mutate(year = as.numeric(spotify_songs$year), decade = year - (year %% 10))

spotify_2 <- spotify_1 %>%
  filter(track_popularity > 50) %>%
  group_by(decade, playlist_genre) %>%
  summarise(count = n())

decadewise_tracks_genre <- spotify_2 %>% 
  group_by(decade) %>%
  ggplot(aes(fill = playlist_genre, x = decade, y = count)) +
  geom_bar(position= "stack", stat = "identity") +
  labs(title = "Popular genre over the decades", x = "Decade", y = "Popularity of Genre")

ggplotly(decadewise_tracks_genre)

Rock music was quite popular in earlier decades of 1960-70s where as pop songs are most popular during 2010.It shows a drastic change in people’s choice of songs in later 2010 from 1960-70s.

2) Feature Analysis:

Correlation between features (numeric) - using Corrplot function
##### Correlation plot for numeric columns
corr_spotify  <- spotify_songs %>%
select(track_popularity, danceability, energy, loudness, speechiness, 
                   acousticness, instrumentalness, liveness, valence, tempo, duration_ms)

corrplot(cor(corr_spotify),type="lower")



We can observe from the correlation matrix that Loudness & Energy have a moderate-to-strong positive correlation. Similarly, Acousticness & Energy, and Acousticness & Loudness have a negative correlation. All other features seems statistically linearly independent.

Correlation within genres

How do the genres correlate with each other? We will calculate the median feature values of each genre and then compute correlation between them to find out.

# average features by genre
avg_feature_genre <- spotify_songs %>%
  group_by(playlist_genre) %>%
  summarise_if(is.numeric, median, na.rm = TRUE) %>%
  ungroup() 

avg_genre_cor <- avg_feature_genre %>%
  select(track_popularity, danceability, energy, loudness, speechiness, 
         acousticness, instrumentalness, liveness, valence, tempo, duration_ms) %>% 
  scale() %>%
  t() %>%
  as.matrix() %>%
  cor() 

colnames(avg_genre_cor) <- avg_feature_genre$playlist_genre
row.names(avg_genre_cor) <- avg_feature_genre$playlist_genre

avg_genre_cor %>% corrplot::corrplot(method = 'color', 
                                     order = 'hclust',
                                     type = 'upper',
                                     tl.col = 'black',
                                     diag = FALSE,
                                     addCoef.col = "grey40",
                                     number.cex = 0.75,
                                     col = colorRampPalette(colors = c(
                                       'red', 
                                       'white', 
                                       'darkblue'))(200),
                                     mar = c(2,2,2,2),
                                     main = 'Correlation Between Median Genre
                                     Feature Values',
                                     family = 'Avenir')

We observe that EDM and Rock are negatively correlated with all other audio features except with each other. Latin and R&B are most similar to each other with correlation of 0.37, while EDM and R&B is the most different with a negative correlation of -0.69.

Variation of popularity with song features
# measuring variation trend of popularity with song/track features

song_features <- c('danceability', 'energy', 'loudness', 'speechiness', 
                   'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms')

popularity_features<- spotify_songs %>%
  select(c('track_popularity', all_of(song_features))) %>%
  group_by(track_popularity) %>%
  summarise_if(is.numeric, mean) %>%
  ungroup()
plot_list = list()
for (i in 1:length(song_features)) {
  plot_list[[i]] = ggplot(popularity_features, aes_string(x = "track_popularity", 
                                                 y = song_features[i])) + 
    geom_point(shape = 20, color = 3) +
    geom_smooth(method = lm,  linetype = "dashed", color = "blue", se = F) + 
      xlab("Track Popularity")
}
suppressMessages(do.call(grid.arrange, 
                         c(plot_list, list(top = "variation trend of popularity with Track features"))))

The danceability seems to be increasing with increasing popularity. Also the instrumentalness of the song is very low for the popular songs. It can be also noticed that the popular songs are generally shorter in duration.

3) Key Insights:

OBSERVATIONS:
  • ‘Love’ is the most popular ‘title’ used in the songs, which is closely followed by “Don’t” and “Like”.

  • Popular songs are just 2.5 - 4 minutes long.

  • Over the years, the duration of the songs has also decreased drastically.

  • In 1960s and 1970s, the ‘Rock’ genre was more popular as compared to the other genres. But over the years, the ‘Pop’ genre has taken over the popularity.

  • Instrumentalness is lower than 0.1 in a majority of observations.

  • Valence, Energy and Danceability seem to be normally distributed.

  • Song’s with higher speechiness are generally not popular.

Song Recommendation using Clustering

rm(dtm_matrix)


spotify_reduced <- spotify_songs %>% select(-c(track_artist, track_name, track_album_release_date,
playlist_genre, mode))
spotify_reduced$key <- as.numeric(spotify_reduced$key)



spotify_reduced <- as.data.frame(scale(spotify_reduced))

Taking 5670 as optimal value and creating cluster

clust <- kmeans(spotify_reduced, 5670)

# final data -
spotify_songs_final <- cbind(spotify_songs, cluster_num = clust$cluster)

Checking goodness of fit for clusters

We know Total SSE = Within Cluster SSE + Between Clusters SSE. If maximum SSE is captured by between clusters, and within clusters SSE is minimized, our clustering is a good fit checking the $ of SSE captured by between Cluster SSE

print(paste0(round(clust$betweenss/clust$totss, 4)*100, "%"))
## [1] "90.99%"

Filtering out the songs which couldnt be clustered or have less neighbors

t <- spotify_songs_final %>% group_by(cluster_num) %>% summarise(n = n())


spotify_songs_final_songs <- spotify_songs_final %>% group_by(cluster_num) %>% filter(n()>3)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Song Suggestion

  1. I Don’t Care by Justin Beiber

Sample Case 1: Imagine you are listening to “I Don’t Care” by Justin Beiber and Ed Sheeran. Pretty popular and a great song for casual listening. Let’s see what are the song suggestions we receive for this particular song.

We see that there are two tracks of same name, only difference is their popularity. They are indeed clustered together by our K-means algorithm.

Based on songs audio features like Acousticness, Daceabiliy, Speechyness, Instrumentalness, etc, our algorithm will suggest songs which are musically closer to “I Don’t Care by Ed Sheeran”. Let’s look at suggested songs -

Let’s look a couple more examples.

  1. Jingle Bells by Ella Fitzgerald

And Corresponding Suggestions :

  1. Señorita by Shawn Mendes :

And Corresponding Suggestions :

  1. South of the Border by Ed Sheeran :

And Corresponding Suggestions :

Summary

Good clusters has high similarity characteristics i.e. low Within Cluster SS, and maximum dissimilarity in characteristics between clusters i.e. high Between SS. In summary, we can measure the Between SS / Total SS ratio, if that is close to 1 (100%) means the clustering fits the data well. This has been proved true for our case as the ratio is at 91%, which indicates that the clustering is pretty accurate.

Limitations

  • The K-Means clustering used has a general limitation of being prone to outliers.
  • If we are adding more data points, as in more songs, then it impacts the cluster definition entirely.
  • If there are more features, the Clustering would become more complex and difficult.