Spotify Data Analysis

Welcome to the world of music

Team members:

Nikita Ahuja
Vinishruth Mocherla

Introduction

If there’s one thing many people can’t live without, it’s music.

Spotify is an international media services provider. The company’s primary business is providing an audio streaming platform, the “Spotify” platform, that provides DRM-restricted music, videos and podcasts from record labels and media companies.

The way Spotify suggest music to listeners has a major influence on their listening habits. The motivation of this project is to enable anyone to discover patterns and insights about the music that they listen to. In doing so, They gain a better understanding of the musical behaviors when they listen to songs on Spotify.

Have you ever wondered how Spotify rates the popularity of songs?
Or ever wonder which factors determine the song’s genre?
What characteristics of a song can determine its popularity?

This analysis aims to answer these questions.

The following tasks are performed:

Correlation between the different variables
Identifying each genre’s features and how Spotify classifies genres
Analyzing features that affect popularity
A predictive model to identify popularity of a song

By performing Data Preparation, Exploratory Data Analysis and Predictive Modeling.

Based on our analysis, the consumer will be able to identify which factors influence the popularity of a song on Spotify.

Packages required

Following packages will be used in the analysis:

tidyverse: set of packages that work in harmony to make it easy to install and load multiple ‘tidyverse’ packages in a single step
ggplot2: ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.
dplyr: dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges.
psych: provides multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis, although others provide basic descriptive statistics
DAAG: provides Data Analysis and Graphics Data and Functions
highcharter: provide a various type of charts, from scatters to heatmaps or treemaps.
knitr: it is a package in the statistical programming language R that enables integration of R code into LaTeX, LyX, HTML, Markdown, AsciiDoc, and reStructuredText documents
kableExtra: allows users to construct complex tables and customize styles using a readable syntax.
DT: provides an R interface to the JavaScript library DataTables. R data objects (matrices or data frames) can be displayed as tables on HTML pages, and DataTables provides filtering, pagination, sorting, and many other features in the tables.
tm: used for text mining data
corplot: It is used for creating correlation matrix, to find colinearity between different covariants
leaps: It performs an exhaustive search for the best subsets of the variables in x for predicting y in linear regression, using an efficient branch-and-bound algorithm.

library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(DAAG)
library(highcharter)
library(knitr)
library(kableExtra)
library(DT)
library(tm)
library(corrplot)
library(leaps)

Data Preparation

Data Source

The data set used in this project can found here Spotify Data

Summary of variable

This data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either data or general metadata arounds songs from Spotify’s API.
The data set contains 32,833 observations of 23 variables.

Following is the summary of all the variables in the data set.

variable_name	description
track_id	unique ID
track_name	Song Name
track_artist	Song Artist
track_popularity	Song Popularity (0-100) where higher is better
track_album_id	Album unique ID
track_album_name	Song album name
track_album_release_date	Date when album released
playlist_name	Name of playlist
playlist_id	Playlist ID
playlist_genre	Playlist genre
playlist_subgenre	Playlist subgenre
danceability	Danceability describes how suitable a track is for dancing based on a combination of musical elements. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
key	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation .
loudness	The overall loudness of a track in decibels (dB).
mode	Mode indicates the modality (major or minor) of a track
speechiness	Speechiness detects the presence of spoken words in a track.
acousticness	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	Predicts whether a track contains no vocals.
liveness	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive.
tempo	The overall estimated tempo of a track in beats per minute (BPM).
duration_ms	Duration of song in milliseconds

Reading the data from .csv file

spotify_data <- read.csv("E:/Vinni_USA/MSIS coursework docs/Spring-20/4th Flex/Data Wrangling/Final Project/spotify/spotify_songs.csv", header = TRUE)
head(spotify_data)

##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x               Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef     Beautiful People (feat. Khalid) - Jack Wins Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
## 5    Lewis Capaldi               69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6       Ed Sheeran               67 2yiy9cd2QktrNvWC2EUi0k
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
## 6     Beautiful People (feat. Khalid) [Jack Wins Remix]
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5               2019-03-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 6               2019-07-11     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
## 6         dance pop        0.675  0.919   8   -5.385    1      0.1270
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052
## 6       0.0799         0.00e+00   0.1430   0.585 124.982      163049

Summary of data

str(spotify_data)

## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : Factor w/ 28356 levels "0017A6SJgTbfQVU2EtsPNo",..: 22912 2531 7160 25706 4705 26672 9521 22445 26146 5283 ...
##  $ track_name              : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9368 12887 944 3111 18360 1968 13859 15785 20934 9823 ...
##  $ track_artist            : Factor w/ 10692 levels "'Til Tuesday",..: 2848 6185 10633 9373 5530 2848 5000 8320 761 8562 ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : Factor w/ 22545 levels "000f3dTtvpazVzv35NuZmn",..: 7684 17645 4144 4691 21907 8636 21592 17795 21050 13719 ...
##  $ track_album_name        : Factor w/ 19743 levels "'74 - '75 (feat. Susan Tyler)",..: 7928 10684 981 2869 15185 1882 11515 13093 17788 8155 ...
##  $ track_album_release_date: Factor w/ 4530 levels "1957-01-01","1957-03",..: 4316 4493 4336 4349 4221 4341 4356 4389 4316 4321 ...
##  $ playlist_name           : Factor w/ 449 levels "\"Permanent Wave\"",..: 309 309 309 309 309 309 309 309 309 309 ...
##  $ playlist_id             : Factor w/ 471 levels "0275i1VNfBnsNbPl0QIBpG",..: 237 237 237 237 237 237 237 237 237 237 ...
##  $ playlist_genre          : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ playlist_subgenre       : Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

summary(spotify_data)

##                    track_id        track_name              track_artist  
##  7BKLCZ1jbUBVqRi2FVlTVw:   10   Poison  :   22   Martin Garrix   :  161  
##  14sOS5L36385FJ3OL8hew4:    9   Breathe :   21   Queen           :  136  
##  3eekarcy7kvN4yt5ZFzltW:    9   Alive   :   20   The Chainsmokers:  123  
##  0nbXyq5TXYPCO7pr3N8S4I:    8   Forever :   20   David Guetta    :  110  
##  0qaWEvPkts34WF68r8Dzx9:    8   Paradise:   19   Don Omar        :  102  
##  0rIAC4PXANcKmitJfoqmVm:    8   (Other) :32726   (Other)         :32196  
##  (Other)               :32781   NA's    :    5   NA's            :    5  
##  track_popularity                track_album_id 
##  Min.   :  0.00   5L1xcowSxwzFUSJzvyMp48:   42  
##  1st Qu.: 24.00   5fstCqs5NpIlF42VhPNv23:   29  
##  Median : 45.00   7CjJb2mikwAWA1V6kewFBF:   28  
##  Mean   : 42.48   4VFG1DOuTeDMBjBLZT7hCK:   26  
##  3rd Qu.: 62.00   2HTbQ0RHwukKVXAlTmCZP2:   21  
##  Max.   :100.00   4CzT5ueFBRpbILw34HQYxi:   21  
##                   (Other)               :32666  
##                     track_album_name track_album_release_date
##  Greatest Hits              :  139   2020-01-10:  270        
##  Ultimate Freestyle Mega Mix:   42   2019-11-22:  244        
##  Gold                       :   35   2019-12-06:  235        
##  Malibu                     :   30   2019-12-13:  220        
##  Rock & Rios (Remastered)   :   29   2013-01-01:  219        
##  (Other)                    :32553   2019-11-15:  215        
##  NA's                       :    5   (Other)   :31430        
##                                                      playlist_name  
##  Indie Poptimism                                            :  308  
##  2020 Hits & 2019  Hits â\200“ Top Global Tracks ðŸ”¥ðŸ”¥ðŸ”¥  :  247  
##  Permanent Wave                                             :  244  
##  Hard Rock Workout                                          :  219  
##  Ultimate Indie Presents... Best Indie Tracks of the 2010s  :  198  
##  Fitness Workout Electro | House | Dance | Progressive House:  195  
##  (Other)                                                    :31422  
##                  playlist_id    playlist_genre
##  4JkkvMpVl4lSioqQjeAL0q:  247   edm  :6043    
##  37i9dQZF1DWTHM4kX49UKs:  198   latin:5155    
##  6KnQDwp0syvhfHOR4lWP7x:  195   pop  :5507    
##  3xMQTDLOIGvj3lWH5e5x6F:  189   r&b  :5431    
##  3Ho3iO0iJykgEQNbjB2sic:  182   rap  :5746    
##  25ButZrVb1Zj1MJioMs09D:  109   rock :4951    
##  (Other)               :31713                 
##                  playlist_subgenre  danceability        energy        
##  progressive electro house: 1809   Min.   :0.0000   Min.   :0.000175  
##  southern hip hop         : 1675   1st Qu.:0.5630   1st Qu.:0.581000  
##  indie poptimism          : 1672   Median :0.6720   Median :0.721000  
##  latin hip hop            : 1656   Mean   :0.6548   Mean   :0.698619  
##  neo soul                 : 1637   3rd Qu.:0.7610   3rd Qu.:0.840000  
##  pop edm                  : 1517   Max.   :0.9830   Max.   :1.000000  
##  (Other)                  :22867                                      
##       key            loudness            mode         speechiness    
##  Min.   : 0.000   Min.   :-46.448   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.: -8.171   1st Qu.:0.0000   1st Qu.:0.0410  
##  Median : 6.000   Median : -6.166   Median :1.0000   Median :0.0625  
##  Mean   : 5.374   Mean   : -6.720   Mean   :0.5657   Mean   :0.1071  
##  3rd Qu.: 9.000   3rd Qu.: -4.645   3rd Qu.:1.0000   3rd Qu.:0.1320  
##  Max.   :11.000   Max.   :  1.275   Max.   :1.0000   Max.   :0.9180  
##                                                                      
##   acousticness    instrumentalness       liveness         valence      
##  Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0151   1st Qu.:0.0000000   1st Qu.:0.0927   1st Qu.:0.3310  
##  Median :0.0804   Median :0.0000161   Median :0.1270   Median :0.5120  
##  Mean   :0.1753   Mean   :0.0847472   Mean   :0.1902   Mean   :0.5106  
##  3rd Qu.:0.2550   3rd Qu.:0.0048300   3rd Qu.:0.2480   3rd Qu.:0.6930  
##  Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910  
##                                                                        
##      tempo         duration_ms    
##  Min.   :  0.00   Min.   :  4000  
##  1st Qu.: 99.96   1st Qu.:187819  
##  Median :121.98   Median :216000  
##  Mean   :120.88   Mean   :225800  
##  3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :239.44   Max.   :517810  
##

Cleaning the data set

Duplicate Data

We observe that many songs have been repeated more than once in this dataset. They have the same ‘track_id’ but have a different ‘playist_id’. So we need to remove those duplicated songs in the dataset. Since the song’s ‘track_id’ is unique and the other quantifiable variables of that song remains the same, we will delete those duplicated songs based on the ‘track_id’.

spotify_data_unique = spotify_data[!duplicated(spotify_data$track_id),]

Redundant Columns

Now since we have no more repeated songs in the list, and we would like to analyze which variables influence the ‘track_popularity’, we can drop the following columns which are not useful in our analysis:

track_id
track_album_id
track_album_name
playlist_name
playlist_id
playlist_subgenre

spotify_data_2 <- spotify_data_unique[c(-1, -5, -6, -8, -9, -11)]
ls(spotify_data_2)

##  [1] "acousticness"             "danceability"            
##  [3] "duration_ms"              "energy"                  
##  [5] "instrumentalness"         "key"                     
##  [7] "liveness"                 "loudness"                
##  [9] "mode"                     "playlist_genre"          
## [11] "speechiness"              "tempo"                   
## [13] "track_album_release_date" "track_artist"            
## [15] "track_name"               "track_popularity"        
## [17] "valence"

Spliting the track_album_release_date

spotify_data_3 <- spotify_data_2 %>%
  separate(track_album_release_date,
           c("track_album_release_year","track_album_release_month","track_album_release_day"),
           sep = "-")

spotify_data_4 <- spotify_data_3[c(-5, -6)]
head(spotify_data_4)

##                                              track_name     track_artist
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix       Ed Sheeran
## 2                       Memories - Dillon Francis Remix         Maroon 5
## 3                       All the Time - Don Diablo Remix     Zara Larsson
## 4                     Call You Mine - Keanu Silva Remix The Chainsmokers
## 5               Someone You Loved - Future Humans Remix    Lewis Capaldi
## 6     Beautiful People (feat. Khalid) - Jack Wins Remix       Ed Sheeran
##   track_popularity track_album_release_year playlist_genre danceability energy
## 1               66                     2019            pop        0.748  0.916
## 2               67                     2019            pop        0.726  0.815
## 3               70                     2019            pop        0.675  0.931
## 4               60                     2019            pop        0.718  0.930
## 5               69                     2019            pop        0.650  0.833
## 6               67                     2019            pop        0.675  0.919
##   key loudness mode speechiness acousticness instrumentalness liveness valence
## 1   6   -2.634    1      0.0583       0.1020         0.00e+00   0.0653   0.518
## 2  11   -4.969    1      0.0373       0.0724         4.21e-03   0.3570   0.693
## 3   1   -3.432    0      0.0742       0.0794         2.33e-05   0.1100   0.613
## 4   7   -3.778    1      0.1020       0.0287         9.43e-06   0.2040   0.277
## 5   1   -4.672    1      0.0359       0.0803         0.00e+00   0.0833   0.725
## 6   8   -5.385    1      0.1270       0.0799         0.00e+00   0.1430   0.585
##     tempo duration_ms
## 1 122.036      194754
## 2  99.972      162600
## 3 124.008      176616
## 4 121.956      169093
## 5 123.976      189052
## 6 124.982      163049

For easier analysis, we have split ‘track_album_release_date’ into three different columns namely:

track_album_release_year
track_album_release_month
track_album_release_day

We will be focussing only on the year of release for analysis. So we will be deleting the ‘track_album_release_month’ & ‘track_album_release_day’ from our dataset.

Missing Values

Now that our data does not contain any duplicate and redundant data, we check for missing values in the data set. We are using colSums function in R to find out missing values in each column.

colSums(is.na(spotify_data_4))

##               track_name             track_artist         track_popularity 
##                        4                        4                        0 
## track_album_release_year           playlist_genre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

We observe that there are 4 missing values in track_name and track_artist columns. We can keep these observations, since missing values for track_name and track_artist wouldn’t impact our analysis.

Boxplots and Outlier analysis

songs_data <- names(spotify_data_4)[c(6:9,11:17)]

  songs <- spotify_data_4 %>%
  select(c('playlist_genre', songs_data)) %>%
  pivot_longer(cols = songs_data)

## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(songs_data)` instead of `songs_data` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

ggplot(data = songs) +
  geom_boxplot(aes(y = value)) + 
  facet_wrap(~name, nrow = 3, scales = "free") +
  coord_flip() +
  ggtitle("Outlier analysis", subtitle = "For different song attributes") +
  theme(axis.text.x = element_text(angle = 50, hjust = 1),axis.text.y = element_blank())

We observe that there are many outliers in the dataset for different variables. Removing these outliers will impact the data in our dataset and will effect our analysis. Until we have a proper justification to remove them, we will be keeping these outliers for now.

Histograms

feature_names <- names(spotify_data_4)[c(3,6:17)]

songs <- spotify_data_4 %>% 
            select(c(feature_names)) %>%
            pivot_longer(cols = feature_names) 

songs %>%
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~name, ncol = 5, scales = 'free') +
  labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
  theme(axis.text.y = element_blank())

We are plotting Histograms to summarize the distribution of variables in the data set. We observe:

Duration and Valence are normally distributed
Danceability, Enery and Loudness is left-skewed
Acousticness, Liveness and Speechiness is right-skewed
Key follows comb distribution since the the bars are alternately tall and short. Indicating rounded-off data.

Cleaned data set

Displaying 100 rows of the cleaned data set.

output_data <- head(spotify_data_4, n = 100)

datatable(output_data, filter = 'top', options = list(pageLength = 25))

Exploratory Data Analysis

Initial EDA

The dimensions of our final data set

dim(spotify_data_4)

## [1] 28356    17

There are 28356 obervations of 17 variables in the cleaned data set.

A glimpse into the data set to identify data types of all the variables.

str(spotify_data_4)

## 'data.frame':    28356 obs. of  17 variables:
##  $ track_name              : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9368 12887 944 3111 18360 1968 13859 15785 20934 9823 ...
##  $ track_artist            : Factor w/ 10692 levels "'Til Tuesday",..: 2848 6185 10633 9373 5530 2848 5000 8320 761 8562 ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_release_year: chr  "2019" "2019" "2019" "2019" ...
##  $ playlist_genre          : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

Summary statistics for the cleaned data set.

summary(spotify_data_4)

##     track_name                       track_artist   track_popularity
##  Breathe :   18   Queen                    :  130   Min.   :  0.00  
##  Paradise:   17   Martin Garrix            :   87   1st Qu.: 21.00  
##  Poison  :   16   Don Omar                 :   84   Median : 42.00  
##  Alive   :   15   David Guetta             :   81   Mean   : 39.33  
##  Forever :   14   Dimitri Vegas & Like Mike:   68   3rd Qu.: 58.00  
##  (Other) :28272   (Other)                  :27902   Max.   :100.00  
##  NA's    :    4   NA's                     :    4                   
##  track_album_release_year playlist_genre  danceability        energy        
##  Length:28356             edm  :4877     Min.   :0.0000   Min.   :0.000175  
##  Class :character         latin:4137     1st Qu.:0.5610   1st Qu.:0.579000  
##  Mode  :character         pop  :5132     Median :0.6700   Median :0.722000  
##                           r&b  :4504     Mean   :0.6534   Mean   :0.698388  
##                           rap  :5401     3rd Qu.:0.7600   3rd Qu.:0.843000  
##                           rock :4305     Max.   :0.9830   Max.   :1.000000  
##                                                                             
##       key            loudness            mode         speechiness    
##  Min.   : 0.000   Min.   :-46.448   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.: -8.309   1st Qu.:0.0000   1st Qu.:0.0410  
##  Median : 6.000   Median : -6.261   Median :1.0000   Median :0.0626  
##  Mean   : 5.368   Mean   : -6.818   Mean   :0.5655   Mean   :0.1080  
##  3rd Qu.: 9.000   3rd Qu.: -4.709   3rd Qu.:1.0000   3rd Qu.:0.1330  
##  Max.   :11.000   Max.   :  1.275   Max.   :1.0000   Max.   :0.9180  
##                                                                      
##   acousticness     instrumentalness       liveness         valence      
##  Min.   :0.00000   Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.01438   1st Qu.:0.0000000   1st Qu.:0.0926   1st Qu.:0.3290  
##  Median :0.07970   Median :0.0000206   Median :0.1270   Median :0.5120  
##  Mean   :0.17718   Mean   :0.0911168   Mean   :0.1910   Mean   :0.5104  
##  3rd Qu.:0.26000   3rd Qu.:0.0065700   3rd Qu.:0.2490   3rd Qu.:0.6950  
##  Max.   :0.99400   Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910  
##                                                                         
##      tempo         duration_ms    
##  Min.   :  0.00   Min.   :  4000  
##  1st Qu.: 99.97   1st Qu.:187742  
##  Median :121.99   Median :216933  
##  Mean   :120.96   Mean   :226576  
##  3rd Qu.:134.00   3rd Qu.:254975  
##  Max.   :239.44   Max.   :517810  
##

Genre Characteristics

Genre Characteristics

songs_data <- names(spotify_data_4)[c(6:9,11:17)]

songs <- spotify_data_4 %>%
            select(c('playlist_genre', songs_data)) %>%
            pivot_longer(cols = songs_data) 

songs %>%
  ggplot(aes(x = value)) +
  geom_density(aes(color = playlist_genre)) +
  facet_wrap(~name, ncol = 4, scales = 'free') +
  labs(title = 'Songs characteristics',x = '', y = '') +
  theme(axis.text.x = element_text(angle = 50, hjust = 1),axis.text.y = element_blank())

From the viz, we observe that the songs of different genres follow different patterns for characteristics

EDM tracks have high energy and high tempo
latin tracks have high valence and danceability
rock songs are most likely to be recorded live and have low danceability.
R&B, rap, and rock songs are more likely to have shorter durations compared to Pop, latin and EDM.

Genre Classification

Based on the density plot, it looks like energy, valence and danceability may provide the most separation between genres during classification, while instrumentalness and key may not help much

So, the combination of all these characteristics of a song contribute to the classification of it into their respective genre.

Correlation within song characteristics

corr_plot_data <- spotify_data_4 %>% 
          select(track_popularity, danceability, energy, key, loudness, mode, speechiness,acousticness,
                 instrumentalness, liveness, valence, tempo, duration_ms)

corrplot(cor(corr_plot_data), 
         method = "color",  
         type = "upper", 
         order = "hclust")

We can see that there exist a

A high positive correlation between energy & loudness
A positive correlation between danceability & valence
A strong negative correlation exists between acousticness & energy
A high negative correlation between acousticness & loudness

We can use correlation matrix to determine if there exists any correlation between track popularity and song characteristics.

We observe popularity a positive correlation between acousticness, loudness, danceability, valence and track popularity. A negative correlation between liveness, energy and instrumentalness.

We also observe mode, speechiness, tempo and key have no strong correlation with track popularity.

Thus, we can conclude that popularity is influenced by the following charateristics:

acousticness
loudness
valence
danceability
liveness
energy
instrumentalness

Analysis of Song Popularity

Characteristics of popular songs

feature_names <- names(spotify_data_2)[c(6,7,9,11:16)]

songs <- spotify_data_4 %>% 
  arrange(desc(track_popularity)) %>%
  head(n = 500) %>%
  pivot_longer(cols = feature_names) 

songs %>%
  ggplot(aes(x = name, y = value)) +
  geom_jitter(aes(color = playlist_genre)) +
  facet_wrap(~name, ncol = 3, scales = 'free') +
  labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
  theme(axis.text.y = element_blank())

From the jitter plot we observe, most popular songs on Spoify are

low in acousticness
low in speechiness
low in liveness
highly danceable
highly energetic
loud

Most popular genre on Spotify

Number of songs in every genre

spotify_data_4 %>%
  filter(!is.na(track_artist)) %>%
  count(playlist_genre) %>%
  ggplot() +
  geom_col(aes(x = playlist_genre, y = n, fill = playlist_genre)) +
  coord_polar() +
  theme(axis.text.x = element_text(hjust = 1), axis.text.y = element_text(hjust = 1)) + 
  ggtitle("Number of songs in every genre") + 
  xlab("Song Genre") + 
  ylab("Number of songs")

This graph gives us insight to count of songs in each genre present in Spotify. We observe that our dataset has more songs of the following genre

rap
pop
EDM

This count will be useful to us in understanding if the popularity of songs depends on its genre.

Most popular songs in every genre

We are now identifying the most popular songs per genre in our dataset.

spotify_data_4 %>%
  select(playlist_genre, track_popularity, track_name) %>%
  group_by(playlist_genre) %>%
  arrange(desc(track_popularity)) %>%
  head(n = 500) %>%
  ggplot(mapping = aes(x = playlist_genre, y =  track_popularity, 
                       color = playlist_genre, shape = playlist_genre,  
                       fill = playlist_genre
                       , label = track_name
                       )) +
  geom_point() +
  theme_minimal() +
  labs(x = 'genre', y = 'song popularity', title = 'Most popular Songs per genre') +
  geom_text(check_overlap = TRUE, data = subset(spotify_data_4, track_popularity > 97) ) +
  theme(plot.title = element_text(hjust = 0.5),legend.position = 'bottom')

From the bar graph,
We observe that pop songs are more popular than the remaining genres, followed by latin and rap .

Also the count of pop songs is more than any other genre in the most popular songs list.

Top 10 songs on Spotify

We are identifying the top 10 popular songs in list.

This will help poeple who are new to music, to listen to the top trending songs. The artist and genre of the song could also be identified from the table for more information.

 top_songs <-
  spotify_data_4 %>%
  select(track_name, track_artist,playlist_genre, track_popularity) %>%
  group_by(playlist_genre) %>%
  arrange(desc(track_popularity)) %>%
  head(n = 10)

  top_songs %>% 
    ggplot(mapping = aes(x = track_name, y =  track_popularity, color = track_name)) +
    geom_point() +
    coord_polar() +
    theme_minimal() +
    labs(x = 'track_name', y = 'track_popularity', title = 'Top 10 songs in Spotify') +
    theme(plot.title = element_text(hjust = 0.5),legend.position = 'bottom')

top_songs %>%
  kable() %>%
  kable_styling()

track_name	track_artist	playlist_genre	track_popularity
Dance Monkey	Tones and I	pop	100
ROXANNE	Arizona Zervas	latin	99
Tusa	KAROL G	pop	98
Memories	Maroon 5	pop	98
Blinding Lights	The Weeknd	pop	98
Circles	Post Malone	pop	98
The Box	Roddy Ricch	rap	98
everything i wanted	Billie Eilish	pop	97
Don’t Start Now	Dua Lipa	pop	97
Falling	Trevor Daniel	pop	97

Analyzing song characteristics of top 10 songs

As we saw earier, track popularity is influenced by acousticness, loudness, valence and danceability.

Of all the popular songs on Spotify, which ones make it to top 10. TO find that out, we analyse characteristics for top 10 songs on Spotify.

Acousticness

top_songs <-
  spotify_data_4 %>%
  select(track_name, playlist_genre, track_popularity, acousticness, loudness, valence, danceability) %>%
  group_by(playlist_genre) %>%
  arrange(desc(track_popularity)) %>%
  head(n = 10)

ggplot(data = top_songs, aes(y = acousticness , x = track_name, fill = playlist_genre , 
                             shape = playlist_genre)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 25, hjust = 0.5)) +
  ggtitle('Acousticness', subtitle = 'For top 10 songs on Spotify')

There exists variation among accouticness values.

Loudness

top_songs <-
  spotify_data_4 %>%
  select(track_name, playlist_genre, track_popularity, acousticness, loudness, valence, danceability) %>%
  group_by(playlist_genre) %>%
  arrange(desc(track_popularity)) %>%
  head(n = 10)

ggplot(data = top_songs, aes(y = loudness , x = track_name, fill = playlist_genre, shape = playlist_genre)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 25, hjust = 0.5)) +
  ggtitle('Loudness', subtitle = 'For top 10 songs on Spotify')

Loudness levels are almost similar expect for ‘everything I wanted’ track.

Valence

top_songs <-
  spotify_data_4 %>%
  select(track_name, playlist_genre, track_popularity, acousticness, loudness, valence, danceability) %>%
  group_by(playlist_genre) %>%
  arrange(desc(track_popularity)) %>%
  head(n = 10)

ggplot(data = top_songs, aes(y = valence , x = track_name, fill = playlist_genre, shape = playlist_genre)) +
  geom_col() +  
  theme(axis.text.x = element_text(angle = 25, hjust = 0.5)) +
  ggtitle('Valence', subtitle = 'For top 10 songs on Spotify')

Valence levels are almost similar except for ‘Falling’ and ‘Everything I wanted’ tracks.

Danceability

top_songs <-
  spotify_data_4 %>%
  select(track_name, playlist_genre, track_popularity, acousticness, loudness, valence, danceability) %>%
  group_by(playlist_genre) %>%
  arrange(desc(track_popularity)) %>%
  head(n = 10)

ggplot(data = top_songs, aes(y = danceability , x = track_name, fill = playlist_genre, shape = playlist_genre)) +
  geom_col() + 
  theme(axis.text.x = element_text(angle = 25, hjust = 0.5)) +
  ggtitle('Danceability', subtitle = 'For top 10 songs on Spotify')

Danceability is high for every track in top 10.

To summarize,

It can be concluded that the songs with highest popularity i.e top 10 songs on Spotify have high danceability, valence and loudness. We cannot observe any pattern for accousticness.

This makes sense, since danceable and loud songs are popular at parties and clubs.

Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). People generally prefer upbeat and happy songs making them more popular.

These observation are in line with the insights we got from the jitter plot for analyzing popular songs.

Top 10 artist on Spotify

We are identifying the top 10 artists in the list. This will help people identify other songs of these artists and listen to them in the future.

top_artist <-
  spotify_data_4 %>%
  select(track_name, track_artist,playlist_genre, track_popularity) %>%
  filter(!is.na(track_artist)) %>%
  arrange(desc(track_popularity)) %>%
  top_n(100) %>%
  count(track_artist) %>%
  arrange(-n) %>%
  head(10)

## Selecting by track_popularity

top_artist %>% 
  ggplot(aes(reorder(track_artist, n), n)) + 
  geom_col(fill = "cyan3") + 
  coord_flip() +
  labs(x = 'Artist', y = 'song count', title = 'Top 10 Artists') +
  theme(plot.title = element_text(hjust = 0.5),legend.position = 'bottom')

According to the graph, Post Malone has the most number of popular songs in our dataset.

Multiple Regression Analysis

To predict popularity based on song characteristics we make use of multiple regression.

Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables.

Multiple linear regression performs the task to predict a dependent variable value, track_popularity in our scenario based on independent variables that is song characteristics.

Model creation

Initial Model

Creating a multiple linear regression model with track_popularity value as the response variable and danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo and duration_ms as the covariates.

model_1 <- lm(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, 
              data = spotify_data_4)

summary(model_1)

## 
## Call:
## lm(formula = track_popularity ~ danceability + energy + key + 
##     loudness + mode + speechiness + acousticness + instrumentalness + 
##     liveness + valence + tempo + duration_ms, data = spotify_data_4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.62 -17.22   2.95  18.10  60.54 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.783e+01  1.704e+00  39.805  < 2e-16 ***
## danceability      3.721e+00  1.072e+00   3.470 0.000522 ***
## energy           -2.321e+01  1.220e+00 -19.028  < 2e-16 ***
## key               3.190e-03  3.844e-02   0.083 0.933870    
## loudness          1.156e+00  6.527e-02  17.711  < 2e-16 ***
## mode              8.616e-01  2.809e-01   3.067 0.002161 ** 
## speechiness      -6.328e+00  1.380e+00  -4.587 4.52e-06 ***
## acousticness      4.331e+00  7.466e-01   5.801 6.67e-09 ***
## instrumentalness -9.292e+00  6.255e-01 -14.856  < 2e-16 ***
## liveness         -4.280e+00  8.990e-01  -4.761 1.93e-06 ***
## valence           1.788e+00  6.565e-01   2.724 0.006458 ** 
## tempo             2.609e-02  5.239e-03   4.979 6.42e-07 ***
## duration_ms      -4.342e-05  2.294e-06 -18.925  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.01 on 28343 degrees of freedom
## Multiple R-squared:  0.05808,    Adjusted R-squared:  0.05768 
## F-statistic: 145.6 on 12 and 28343 DF,  p-value: < 2.2e-16

It can be noticed that all the covariates in the model are significant expect key since the p-value for each of them is less than 0.05.

Besides, the Adjusted R- squared values is 0.05768 which is moderate. p-value of the model is < 2.2e-16 suggesting all the results are significant.

However, We are performing variable selection process to identify the significant covariates.

Variable selection

model_3 = regsubsets(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, 
             data = spotify_data_4,
             nbest = 7)

plot(model_3, scale = "bic")

According to best subset selection, the influence of ‘Energy’ > ‘Loudness’.

Upon comparing both these results we can arrive at the conclusion that 1 1 0 1 1 1 1 1 1 1 1 1 is the best linear regression model for this dataset or in other words, all variables except ‘key’ are statiscally significant in predicting the track popularity.

Final Model

model_2 <- lm(track_popularity ~ danceability + energy + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, 
             data = spotify_data_4)

summary(model_2)

## 
## Call:
## lm(formula = track_popularity ~ danceability + energy + loudness + 
##     mode + speechiness + acousticness + instrumentalness + liveness + 
##     valence + tempo + duration_ms, data = spotify_data_4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.624 -17.226   2.949  18.099  60.533 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.785e+01  1.692e+00  40.113  < 2e-16 ***
## danceability      3.720e+00  1.072e+00   3.469 0.000523 ***
## energy           -2.321e+01  1.219e+00 -19.030  < 2e-16 ***
## loudness          1.156e+00  6.527e-02  17.712  < 2e-16 ***
## mode              8.576e-01  2.765e-01   3.101 0.001930 ** 
## speechiness      -6.326e+00  1.379e+00  -4.586 4.54e-06 ***
## acousticness      4.331e+00  7.465e-01   5.803 6.60e-09 ***
## instrumentalness -9.292e+00  6.254e-01 -14.856  < 2e-16 ***
## liveness         -4.280e+00  8.990e-01  -4.761 1.93e-06 ***
## valence           1.789e+00  6.564e-01   2.726 0.006414 ** 
## tempo             2.608e-02  5.239e-03   4.979 6.44e-07 ***
## duration_ms      -4.342e-05  2.294e-06 -18.928  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.01 on 28344 degrees of freedom
## Multiple R-squared:  0.05808,    Adjusted R-squared:  0.05771 
## F-statistic: 158.9 on 11 and 28344 DF,  p-value: < 2.2e-16

Adjusted R- squared values is 0.05717. This implies that the model can predict the track popularity and is able to explain 5.71% of the variation in the data set.

Model Adequacy Check

par(mfrow = c(1,2))
# generate QQ plot
qqnorm(model_2$residuals,main = "Model")
qqline(model_2$residuals)

# generate Scatter Plot
plot(model_2$fitted.values,model_2$residuals,pch = 20)
abline(h = 0,col = "grey")

From the graphs, we observe that the qq plot is not ideal and the data in the scatterplot is not evenly distributed.

Therefore, this dataset doesn’t completely satisfy the normality, linearity and equal variance assumptions.

Prediction Analysis

Now we use the model created to make predictions about the track popularity.

new_popularity <- data.frame(danceability = 0.718,
                             energy = 0.93,
                             loudness = -3.778,
                             mode = 1,
                             speechiness = 0.102,
                             acousticness = 0.0287,
                             instrumentalness = 0,
                             liveness = 0.204,
                             valence = 0.277,
                             tempo = 121.956,
                             duration_ms = 169093)

print(paste0("Observed popularity: ",60))

## [1] "Observed popularity: 60"

predicted <- predict(model_2, newdata = new_popularity)
print(paste0("Predicted popularity: ",(predicted)))

## [1] "Predicted popularity: 40.3716218419876"

We observe the values we get for popularity is less then the observed values. The variation exists because of skewness in the data.

The model needs to tranformaed to make accurate predictions about the popularity.

Summary

Conclusion

The popularity of a song is most influenced by the dancability, loundness and valence of the song. We came to this conclusion from the correlation matrix and jitter plot for the most popular songs on Spotify
The factors that determine the song’s genre are: danceability, energy and valence. We came to this conclusion from the density plot of characteristics of the songs.
Spotify could be determining a song’s popularity based on all the characteristics of apart from ‘key’. We concluded this from the model we created using multiple linear regression analysis through the variable selection method.
Pop genre has the highest number of popular songs on Spotify. We concluded this from the bar graph we plotted to classify the top 100 songs according to their genre.

Insights

A common assumption is that energy influences popularity like energetic songs are more popular. However, we could not find and correlation betweeen popularity and energy
Number of songs belonging to all genres in the top 100 were not evenly distributed. We observe that people prefer pop music over other genres.

Implications

The model which we created could be used by people to calculate popularity. That factor would help people understand how the song will fare when it will be released
This analysis can be helpful to students studying music or wanting to pursue a career in music

Future Scope

We can improve the model by applying transformations on the dependent variable and covariants. We will be able to get a better model for prediction analysis
We can included sub-genre to be considered it as a factor which determines the popularity of a song.
Combining different datasets related to music apart from the Spotify data wil be helpful in better analysis of the song’s popularity.