Introduction

Background

With over 320 million monthly users and home to 60 million tracks, four billion playlists and 1.9 million podcasts, Spotify is one of the most popular music (and, increasingly, talk content) streaming platforms in existence. Similar to its big tech rivals and partners, much of Spotify’s success has been fueled by data and analytics. By collecting and analyzing massive amounts of listener data, Spotify can identify emerging user trends in real-time and rapidly develop new features or services to capitalize on them. One of Spotify’s major competitive advantages is it’s formidable recommendation engine. Using machine learning (ML) algorithms, natural language processing (NLP) and convolutional neural networks (CNN), Spotify is able to transform historical listening data into personalized playlists and music recommendations.

For scope this project we are interested in how track popularity is getting influenced by other attributes likes danceability, loudness, speechiness, valence etc.


Analytical Methodology

The plan is to analyze relationship between popularity and different features of the song, and also perform cluster analysis using K-means method to get idea about songs genre and random forest to predict song popularity.


Benefit of Analysis

This is mainly useful to market to the spotify users and improve their experience while using it. This analysis will help to predict popularity of the new song based on its attributes way before hitting markets.

Packages Required

Following packages were used:

  • tidyverse - Which will provide us functionality to model, transform, and visualize data.

  • ggplot2 - Used for plotting charts

  • plotly - For web-based graphs via the open source JavaScript graphing library plotly.js for interactive charts

  • corrplot - For displaying correlation matrices and confidence intervals

  • factoextra - To visualize the output of multivariate data analysis

  • funModeling - Exploratory Data Analysis and Data Preparation Tool-Box

  • RColorBrewer - To help you choose sensible colour schemes for figures in R

  • Lubridate - It is a package that eases working with Date and Time datatypes

  • Knitr - It enables the integration of R code into R markdown and in our case we used it to display the variables in a neat scrollable tabular format.

  • DT - Data objects in R can be rendered as HTML by importing this package.

  • cowplot - For providing addition functionalities to ggplot.

  • vtable - To print the summary statistics of the data

  • cluster : To use clustering algorithm

  • factoextra : Visualizing clustering algorithm

  • purrr : Purrr is a package that fills in the missing pieces in R’s functional programming tools: it’s designed to make your pure functions

  • randomForest : To perform Random Forest algorithm

library(tidyverse)
library(ggplot2)
library(plotly)
library(corrplot)
library(factoextra)
library(knitr)
library(RColorBrewer)
library(funModeling)
library(knitr)
library(lubridate)
library(DT)
library(cowplot)
library(vtable)
library(cluster)    
library(factoextra) 
library(purrr)
library(randomForest)

Data Dictionary

Description of Attributes Each row indicates 1 song and column contain attributes for each song.The attributes are as follows

  • track_id: Track ID on song

  • track_name: Title / Name of the song

  • track_artist: Name of the artist

  • track_popularity: Measure the popularity from 0 to 100 based on play number of the track

  • track_album_release_date: Information about the release date of the song

  • track_album_name: Provides us with the name of the album from which the song is in.

  • playlist_name: Name of the playlist which the song is in.

  • playlist_genre: Name of the genre related to the playlist which the song is in.

  • acousticness : Measure of how acoustic the track is and ranges from 0.0 to 1.0

  • danceability: Describes how suitable a track is for dancing. Values range from 0.0 being least danceable and 1.0 being most danceable.

  • duration_ms : The duration of the track in milliseconds(ms) which has been converted to minutes using transformation

  • energy: Measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity i.e. the enery of the song.

  • instrumentalness: Measure whether a track contains vocals. Sounds are treated as instrumental in this context. Values ranges from 0.0 to 1.0

  • speechiness: Detects the presence of spoken words in a track.Values > 0.6 might be a podcast or talk show, where 0.3 to 0.6 is the normal range for songs and if its less than 0.3 its mostly music

  • valence: Measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive , while tracks with low valence sound more negative.

  • key: Estimated overall key of the track. If key is not detected, the value is -1.

  • liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

  • loudness : overall loudness of a track in decibels (dB).Values typical range between -60 and 0 dB.

  • mode: Mode indicates the modality (major or minor) of a track. Major is represented by 1 and minor is represented by 0.

  • tempo: Overall estimated tempo of a track in beats per minute (BPM).

Data Preparation

This sections contains all the procedures we have followed in preparing the data for analysis. Each step has been explained with code for those steps.

Data Source

The dataset used for this project is the Spotify Genre dataset was provided in the course curriculum


Data Loading

#### Reading Data 
spotify_songs <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

### Checking dimension of Data
dim(spotify_songs)
## [1] 32833    23

The original dataset has 32833 rows and 23 columns, which was collected from every genre, which is an interesting visualization of the spotify genre-space maintained by a genre taxonomist. The dataset includes 5000 songs for each genre, split across various sub-genre. The main purpose of the original dataset was to explore the following audio features:

  • Confidence Measures: Acousticness, Liveness, Speechiness, Instrumentalness
  • Perceptual Measures: Energy, Loudness, Danceability and Valence
  • Descriptors: Duration, Tempo, Key and Mode

The dataset consists of the following variables:

#### Checking column name
names(spotify_songs)
##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

Data Cleaning

Step 1: Handling Missing and Empty Values

#### Counting NA values in every column
colSums(is.na(spotify_songs))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
#### Removing NA values from the data 
spotify_songs <- na.omit(spotify_songs)

As we can see that the track_name,track_album_name and track_artist variables contain 5 missing values, we decided to remove them since it would hamper our analysis. A total of 5 rows were omitted, which would not have a severe impact on the insights derived from the dataset.

Step 2: Checking the structure and changing datatypes of certain variables

#### checking Structure of the data 
str(spotify_songs)
## tibble [32,828 x 23] (S3: tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32828] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32828] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32828] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32828] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32828] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32828] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32828] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32828] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32828] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32828] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32828] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32828] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32828] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32828] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32828] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32828] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32828] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32828] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32828] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32828] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32828] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32828] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32828] 194754 162600 176616 169093 189052 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 8152 9283 9284 19569 19812
##   ..- attr(*, "names")= chr [1:5] "8152" "9283" "9284" "19569" ...
#### checking Summary of the data 
summary(spotify_songs)
##    track_id          track_name        track_artist       track_popularity
##  Length:32828       Length:32828       Length:32828       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32828       Length:32828       Length:32828            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32828       Length:32828       Length:32828       Length:32828      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6549   Mean   :0.698603   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1754   Mean   :0.0847599  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187805  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225797  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253581  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810
#### Changing datatype of some columns
spotify_songs<-spotify_songs %>%
  mutate(playlist_genre=as.factor(spotify_songs$playlist_genre),
         playlist_subgenre=as.factor(spotify_songs$playlist_subgenre),
         mode=as.factor(mode),
         key=as.factor(key))

Step 3: Removing Duplicate

#### removing duplicated data 
spotify_songs <- spotify_songs[!duplicated(spotify_songs$track_id),]
dim(spotify_songs)
## [1] 28352    23

Step 4: Extracting Year and Song Duration in minutes.

We aim at analyzing the trends that the data follows according to the artist name and genre types over the years that it was released in. We thereby split the track_album_release_date into year, month and day

#### Extracting Year from songs
spotify_songs <- spotify_songs %>%
separate(track_album_release_date,
c("year","month","day"),
sep = "-") 

#### Creating minutes from duration
spotify_songs<-spotify_songs %>% 
  mutate(duration_min=duration_ms/60000)

#### changing data type of year column
spotify_songs$year <- as.numeric(spotify_songs$year)

Step 5: Selecting the required columns from the dataset

#### Dropping unneccessary columns
spotify_songs <- spotify_songs %>% select(-c(track_id,track_album_id,playlist_id))

Data Preview

A preview of the clean dataset is given below:

### displaying top 100 rows
output_data <- head(spotify_songs, n = 100)
datatable(output_data, filter = 'top', options = list(pageLength = 25))

Exploratory Data Analysis

Correlation between variables

In order to understand the correlation among variables, we’ll use corrplot function in R which is one of the basic data visualization functions.

### Correlation plot of numeric columns
songs_corr <- spotify_songs %>%
select(track_popularity,danceability,energy,loudness,speechiness,acousticness,instrumentalness, liveness, valence, tempo)
par(bg="#121212")
corrplot(cor(songs_corr),method = 'pie',type="lower",bg="#121212",col="#1DB954",tl.col="#1DB954",addgrid.col = "#1DB954")

Based on the plot, we can state that popularity does not have strong correlation with other track features. But quite a few variables have strong correlation with each other, indicating multicollinearity and might not be suitable for classification algorithms.


Density Plots of Variables

Let’s see energy, danceability, valence, acousticness, speechiness and liveness are distributed over all the observations of our dataset. We would be plotting density plots of all these 6 variables together as they all on same scale and range from 0 to 1.

#### Plotting Density Plots
ggplot(spotify_songs) +
  geom_density(aes(energy, fill ="energy", alpha = 0.1)) + 
  geom_density(aes(danceability, fill ="danceability", alpha = 0.1)) + 
  geom_density(aes(valence, fill ="valence", alpha = 0.1)) + 
  geom_density(aes(acousticness, fill ="acousticness", alpha = 0.1)) + 
  geom_density(aes(speechiness, fill ="speechiness", alpha = 0.1)) + 
  geom_density(aes(liveness, fill ="liveness", alpha = 0.1)) + 
  scale_x_continuous(name = "Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
  scale_y_continuous(name = "Density") +
  ggtitle("Density plot of Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
  theme_bw() +
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(legend.title=element_blank()) +
  scale_fill_brewer(palette="Accent") + 
  theme(panel.background = element_rect(fill = "#121212")) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))


Box Plot

Genre by energy

#### Ploting Box Plot of genre by energy
ggplot(spotify_songs, aes(x=energy, y=playlist_genre)) +
  geom_boxplot(color="white", fill="darkgreen")  +
  scale_x_continuous(name = "Energy") +
  scale_y_discrete(name = "Genre") +
  theme_bw() +
  ggtitle("Variation: Energy and Genre") +
  
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(legend.title=element_blank()) +
  scale_fill_brewer(palette="Accent") + 
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))

  • The plot shows that EDM genre has songs with highest energy.

Genre by danceability

#### Ploting Box Plot of genre by danceability
ggplot(spotify_songs, aes(x=danceability, y=playlist_genre)) +
  geom_boxplot(color="white", fill="darkgreen")  +
  scale_x_continuous(name = "Danceability") +
  scale_y_discrete(name = "Genre") +
  theme_bw() +
  ggtitle("Danceability and Genre") +
  
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(legend.title=element_blank()) +
  scale_fill_brewer(palette="Accent") + 
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))

  • As seen in the graph, Rap genre has the highest danceability factor

Genre by liveness

#### Ploting Box Plot of genre by liveness
ggplot(spotify_songs, aes(x=liveness, y=playlist_genre)) +
  geom_boxplot(color="white", fill="darkgreen")  +
  scale_x_continuous(name = "Liveness") +
  scale_y_discrete(name = "Genre") +
  theme_bw() +
  ggtitle("Liveness and Genre") +
  
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(legend.title=element_blank()) +
  scale_fill_brewer(palette="Accent") + 
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))

  • Looks like EDM songs are most lively, followed closely by rock genre.

Genre by valence

#### Ploting Box Plot of genre by valence
ggplot(spotify_songs, aes(x=valence, y=playlist_genre)) +
  geom_boxplot(color="white", fill="darkgreen")  +
  scale_x_continuous(name = "Valence") +
  scale_y_discrete(name = "Genre") +
  theme_bw() +
  ggtitle("Valence and Genre") +
  
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(legend.title=element_blank()) +
  scale_fill_brewer(palette="Accent") + 
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))

  • As seen above, Latin genre has a higher valence than others

Genre by loudness

#### Ploting Box Plot of genre by loudness
ggplot(spotify_songs, aes(x=loudness, y=playlist_genre)) +
  geom_boxplot(color="white", fill="darkgreen")  +
  scale_x_continuous(name = "Loudness") +
  scale_y_discrete(name = "Genre") +
  theme_bw() +
  ggtitle("Loudness and Genre") +
  
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(legend.title=element_blank()) +
  scale_fill_brewer(palette="Accent") + 
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))

  • The loudness is pretty similar, only songs in EDM genre are a bit louder than the other genres.

Energy Distribution of the songs

#### Histogram of Energy Distribution
spotify_songs$energy_only <- cut(spotify_songs$energy, breaks = 10)
spotify_songs %>%
  ggplot( aes(x = energy_only )) +
  geom_bar(width = 0.2, fill = "#1DB954", colour = "black") +
  scale_x_discrete(name = "Energy") +
    theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
       text = element_text(size = 10,colour = "darkgreen")) +
  theme(axis.text.x = element_text(colour = "#1DB954"))+
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))

  • This plot shows that higher energy songs are popular among Spotify listeners.

Speechiness Distribution of the songs

spotify_songs$speech_only <- cut(spotify_songs$speechiness, breaks = 10)
spotify_songs %>%
  ggplot( aes(x = speech_only , colour ="1DB954")) +  
  geom_bar(width = 0.2,  fill = "#1DB954", colour = "black") +
  scale_x_discrete(name = "Speechiness") +
    theme(axis.text.x = element_text(colour = "darkgreen"))+
    theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
       text = element_text(size = 10,colour = "#1DB954")) +
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))+
  coord_flip()

  • This plot shows that less speechy songs are more favoured by maximum Spotify listeners.

Trend Analysis by Year

trend_chart <- function(arg){
trend_change <- spotify_songs %>% filter(year>2010) %>% group_by(year) %>% summarize_at(vars(all_of(arg)), funs(Average = mean)) 
  
  
chart <- ggplot(data = trend_change, aes(x = year, y = Average)) +
geom_line(color = "#1DB954", size = 1) +
scale_x_continuous(breaks=seq(2011, 2020, 1)) + scale_y_continuous(name=paste("",arg,sep=""))  +
theme(axis.text.x = element_text(colour = "darkgreen")) +
    theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
       text = element_text(size = 10,colour = "#1DB954")) +
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212")) +
     theme(axis.text.y = element_text(colour = "darkgreen"))
return(chart)
}

trend_chart_track_popularity<-trend_chart("track_popularity")
trend_chart_danceability<-trend_chart("danceability")
trend_chart_energy<-trend_chart("energy")
trend_chart_loudness<-trend_chart("loudness")
trend_chart_duration_min<-trend_chart("duration_min")
trend_chart_speechiness<-trend_chart("speechiness")



plot_grid(trend_chart_track_popularity, trend_chart_danceability, trend_chart_energy, trend_chart_loudness, trend_chart_duration_min, trend_chart_speechiness,ncol = 2, label_size = 1)

  • To find out the trend of how the features change across time.We can group the songs by its added year, get the average for each feature over time and visualize it.

  • What interests us the most is that the duration of tracks is showing continuous decreasing trend i.e. the songs are getting shorter and shorter with each year.


Summary Statistics of the clean data

### Summary statistics of all the variables available in the data
st(spotify_songs)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
track_popularity 28352 39.335 23.699 0 21 58 100
year 28352 2011.054 11.23 1957 2008 2019 2020
playlist_genre 28352
… edm 4877 17.2%
… latin 4136 14.6%
… pop 5132 18.1%
… r&b 4504 15.9%
… rap 5398 19%
… rock 4305 15.2%
playlist_subgenre 28352
… album rock 1039 3.7%
… big room 1034 3.6%
… classic rock 1100 3.9%
… dance pop 1298 4.6%
… electro house 1416 5%
… electropop 1251 4.4%
… gangster rap 1314 4.6%
… hard rock 1202 4.2%
… hip hop 1296 4.6%
… hip pop 803 2.8%
… indie poptimism 1547 5.5%
… latin hip hop 1194 4.2%
… latin pop 1097 3.9%
… neo soul 1478 5.2%
… new jack swing 1036 3.7%
… permanent wave 964 3.4%
… pop edm 967 3.4%
… post-teen pop 1036 3.7%
… progressive electro house 1460 5.1%
… reggaeton 687 2.4%
… southern hip hop 1582 5.6%
… trap 1206 4.3%
… tropical 1158 4.1%
… urban contemporary 1187 4.2%
danceability 28352 0.653 0.146 0 0.561 0.76 0.983
energy 28352 0.698 0.184 0 0.579 0.843 1
key 28352
… 0 3001 10.6%
… 1 3436 12.1%
… 2 2478 8.7%
… 3 797 2.8%
… 4 1925 6.8%
… 5 2301 8.1%
… 6 2261 8%
… 7 2907 10.3%
… 8 2066 7.3%
… 9 2631 9.3%
… 10 1972 7%
… 11 2577 9.1%
loudness 28352 -6.818 3.036 -46.448 -8.31 -4.709 1.275
mode 28352
… 0 12318 43.4%
… 1 16034 56.6%
speechiness 28352 0.108 0.103 0 0.041 0.133 0.918
acousticness 28352 0.177 0.223 0 0.014 0.26 0.994
instrumentalness 28352 0.091 0.233 0 0 0.007 0.994
liveness 28352 0.191 0.156 0 0.093 0.249 0.996
valence 28352 0.51 0.234 0 0.329 0.695 0.991
tempo 28352 120.958 26.955 0 99.972 133.999 239.44
duration_ms 28352 226574.631 61081.364 4000 187741.25 254975.25 517810
duration_min 28352 3.776 1.018 0.067 3.129 4.25 8.63
energy_only 28352
… (-0.000825,0.1] 67 0.2%
… (0.1,0.2] 225 0.8%
… (0.2,0.3] 518 1.8%
… (0.3,0.4] 1168 4.1%
… (0.4,0.5] 2368 8.4%
… (0.5,0.6] 3627 12.8%
… (0.6,0.7] 4964 17.5%
… (0.7,0.8] 5797 20.4%
… (0.8,0.9] 5763 20.3%
… (0.9,1] 3855 13.6%
speech_only 28352
… (-0.000918,0.0918] 18385 64.8%
… (0.0918,0.184] 4869 17.2%
… (0.184,0.275] 2459 8.7%
… (0.275,0.367] 1648 5.8%
… (0.367,0.459] 736 2.6%
… (0.459,0.551] 174 0.6%
… (0.551,0.643] 53 0.2%
… (0.643,0.734] 12 0%
… (0.734,0.826] 8 0%
… (0.826,0.919] 8 0%

Random Forest

In this section, we executed Random Forest on the Spotify dataset and would try to predict track popularity. We consider top 25 percentile values as popular songs rest non popular songs

#selecting relevant the data
set.seed(2021)

part<-sample(1:nrow(spotify_songs), nrow(spotify_songs)*.75)

data_for_popularity_analysis <- spotify_songs %>% 
  select(c('energy', 'liveness','tempo', 'speechiness', 'acousticness',
           'instrumentalness', 'danceability', 'duration_min' ,
           'loudness','valence' ,'track_popularity','key','mode','playlist_genre')) %>%
  mutate( track_popularity = if_else(track_popularity > 62 , 1,0))  
                  
#Splitting data in train and Test
spotify_songs_train<- data_for_popularity_analysis[part,]
spotify_songs_test <- data_for_popularity_analysis[-part,]

# running random Forest 
spotify_rand <- randomForest(as.factor(track_popularity)~., data=spotify_songs_train, mtry= 4,importance =TRUE )
spotify_rand
## 
## Call:
##  randomForest(formula = as.factor(track_popularity) ~ ., data = spotify_songs_train,      mtry = 4, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 18.75%
## Confusion matrix:
##       0   1 class.error
## 0 16947 445  0.02558648
## 1  3543 329  0.91503099
  • OOB value error rate is 18.75% i.e. model accuracy in training data is 82.75%

Analysis of Random Forest Model

Calculating Confusion Matrix

#Predicting songs popularity on test data
predict_test <- predict(spotify_rand, spotify_songs_test[,-11])
table(spotify_songs_test$track_popularity, predict_test)
##    predict_test
##        0    1
##   0 5687  123
##   1 1172  106

Plotting Variable Importance

###Claulating variable importance
important <- importance(spotify_rand)
varImportance <- data.frame(Variables = row.names(important),
                           Importance = round(important[,3],2))
rankImportance <- varImportance%>%
      mutate(Rank= paste('#',dense_rank(desc(Importance))))

ggplot(rankImportance,aes(x=reorder(Variables,Importance) ,y=Importance,fill=Importance))+ 
geom_bar(stat = "identity",fill = "#1DB954") +
geom_text(aes(x = Variables, y = 0.5, label = Rank),hjust=0, vjust=0.55, size = 4, colour = "black") +
  theme_bw() +
  ggtitle("Variable Importance") +
    scale_x_discrete(name = "Variables") +
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen")) +
coord_flip()

Conclusion

A common assumption is that energy influences popularity like energetic songs are more popular. However, we could not find and correlation between popularity and energy. Number of songs belonging to all genres in the top 100 were not evenly distributed.

Yes we have sliced the track_album_release_date variable into year,month and year. We have also created new variables track_album_release_year, popularity_group etc. We are trying to find the track popularity using different features of the song. We have used newly created variables viz. trend_chart_track_popularity, trend_chart_danceability, trend_chart_energy, trend_chart_loudness, trend_chart_duration_min, trend_chart_speechiness to find out the trends.

What interests us the most is that the duration of tracks is showing continuous decreasing trend. Meaning that the songs are getting shorter and shorter with each year. Furthermore, the danceability of tracks is on continuous rise, which is a good thing as people are enjoying danceable songs. Energy and loudness have almost the same trend each year showing high positive correlation between them. Both the features have peaks and dips on trend in the same years.

The average popularity of the songs reached its minimum value in 2014 in last 1 decade and after that it’s has been continuously increasing, depicting that the songs are becoming popular with time on average among people.

We have used trend charts to find out how the features change across time. In order to understand the correlation among variables, we have used corrplot function in R. We have used boxplots to find out the outliers.

Even though there are millions of songs that exist, we only had about 32k records for our analysis, and hence we couldn’t obtain a full picture of the features of music. Also, the analysis could be strengthened by incorporating user related features like their demographical attributes, user history etc. We used k-mean clustering algorithm to find out the repecstive genre of the song in the cluster but because perhaps data size we couldn’t clear cluster. Also we try predict a songs track’s popularity from key features about the song using Random Forest that indicate algorithm is fairly good classifying not popular but average in predict predicting songs popularity. Playlist genre and loudness are two major factor to contribute song’s popularity score