Spotify data analysis

Introduction

In this project, I shall be analysing some of the audio features of the songs avaibale on spotify in order to guage and come up with an insight about the factors which can contribute directly or indirectly to the popularity of a song. The data was obtained from https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md The data had features connected to the track such as the artist and genre details as well information about the musicality of the song such as valence, loudness, energy etc.

Problem statement :

The problem that I am trying to address with this project is to try to find and analyse the features which directly contribute to the popularity of a song. There are many different kinds of songs which are releasing every day. What makes a song more likeable than another? Why is one genre more popular than another? These are some of the questions I plan on addressing through this project.

Proposed approach :

At the initial stage of the project, I plan on performing some exploratory data analysis using visualizations to come up with solutions to those questions.

This analysis might be helpful for artists in order to come with songs tailored to meet the expectation of their audience, and also in getting better reach for their songs.

Packages Required

These are the packages that I have used for this analysis so far:

  • Tibble: Used to store data as a tibble, and makes it much easier to handle and manipulate data
  • DT: Used to display the data on the screen in a scrollable format
  • Knitr: Used to display an aligned table on the screen
  • TM: Used for text mining on the “Genre” columns in the data
  • Dplyr: Used for data manipulation
  • Ggplot2: Used to plot charts
  • Wordcloud: Used to chart wordcloud in the genre text analysis
  • Fitdistrplus: Used for statistical analysis (distribution fitting)
  • Plotly: Used to plot interactive charts
library(tibble)
library(DT)
## Warning: package 'DT' was built under R version 3.6.2
library(knitr)
library(tm)
## Warning: package 'tm' was built under R version 3.6.2
## Loading required package: NLP
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.2
## Loading required package: RColorBrewer
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(GGally)
## Warning: package 'GGally' was built under R version 3.6.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(highcharter)
## Warning: package 'highcharter' was built under R version 3.6.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use

Data preparation

The following steps were performed to prepare the data and initial analysis was done

Importing the data

spotify <- read.csv("C:/Users/leksh/Documents/SS2020/R/Week4/spotify_songs.csv", stringsAsFactors = FALSE)
head(spotify)
##                 track_id   track_name   track_artist track_popularity
## 1 2XU0oxnq2qxCpomAAuJY8K Dance Monkey    Tones and I              100
## 2 2XU0oxnq2qxCpomAAuJY8K Dance Monkey    Tones and I              100
## 3 696DnlkuDOXcMAnKlTgXXK      ROXANNE Arizona Zervas               99
## 4 696DnlkuDOXcMAnKlTgXXK      ROXANNE Arizona Zervas               99
## 5 696DnlkuDOXcMAnKlTgXXK      ROXANNE Arizona Zervas               99
## 6 696DnlkuDOXcMAnKlTgXXK      ROXANNE Arizona Zervas               99
##           track_album_id                            track_album_name
## 1 0UywfDKYlyiu1b38DRrzYD Dance Monkey (Stripped Back) / Dance Monkey
## 2 0UywfDKYlyiu1b38DRrzYD Dance Monkey (Stripped Back) / Dance Monkey
## 3 6HJDrXs0hpebaRFKA1sF90                                     ROXANNE
## 4 6HJDrXs0hpebaRFKA1sF90                                     ROXANNE
## 5 6HJDrXs0hpebaRFKA1sF90                                     ROXANNE
## 6 6HJDrXs0hpebaRFKA1sF90                                     ROXANNE
##   track_album_release_date
## 1               10/17/2019
## 2               10/17/2019
## 3               10/10/2019
## 4               10/10/2019
## 5               10/10/2019
## 6               10/10/2019
##                                                                                                        playlist_name
## 1                                                                  post-teen alternative, indie, pop (large variety)
## 2                                                                                          Global Top 50 | 2020 Hits
## 3                                                                                          Global Top 50 | 2020 Hits
## 4                                                                                                 Contemporary Urban
## 5 Charts 2020 🔥Top 2020🔥Hits 2020🔥Summer 2020🔥Pop 2020🔥Popular Music🔥Clean Pop 2020🔥Sing Alongs
## 6 Charts 2020 🔥Top 2020🔥Hits 2020🔥Summer 2020🔥Pop 2020🔥Popular Music🔥Clean Pop 2020🔥Sing Alongs
##              playlist_id playlist_genre  playlist_subgenre danceability energy
## 1 1y42gwI5cuwjBslPyQNfqb            pop      post-teen pop        0.824  0.588
## 2 1KNl4AYfgZtOVm9KHkhPTF          latin      latin hip hop        0.824  0.588
## 3 1KNl4AYfgZtOVm9KHkhPTF          latin      latin hip hop        0.621  0.601
## 4 6wyJ4bsjZaUKa9f6GeZlAO            r&b urban contemporary        0.621  0.601
## 5 3xMQTDLOIGvj3lWH5e5x6F            r&b            hip pop        0.621  0.601
## 6 3xMQTDLOIGvj3lWH5e5x6F            edm            pop edm        0.621  0.601
##   key loudness mode speechiness acousticness instrumentalness liveness valence
## 1   6   -6.400    0      0.0924       0.6920         0.000104    0.149   0.513
## 2   6   -6.400    0      0.0924       0.6920         0.000104    0.149   0.513
## 3   6   -5.616    0      0.1480       0.0522         0.000000    0.460   0.457
## 4   6   -5.616    0      0.1480       0.0522         0.000000    0.460   0.457
## 5   6   -5.616    0      0.1480       0.0522         0.000000    0.460   0.457
## 6   6   -5.616    0      0.1480       0.0522         0.000000    0.460   0.457
##     tempo duration_ms
## 1  98.027      209438
## 2  98.027      209438
## 3 116.735      163636
## 4 116.735      163636
## 5 116.735      163636
## 6 116.735      163636

Inspecting the data

colnames(spotify)
##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"
dim(spotify)
## [1] 32833    23

We observe that there are 32833 rows and 23 columns in the dataset.

Structure of the dataset

str(spotify)
## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "2XU0oxnq2qxCpomAAuJY8K" "2XU0oxnq2qxCpomAAuJY8K" "696DnlkuDOXcMAnKlTgXXK" "696DnlkuDOXcMAnKlTgXXK" ...
##  $ track_name              : chr  "Dance Monkey" "Dance Monkey" "ROXANNE" "ROXANNE" ...
##  $ track_artist            : chr  "Tones and I" "Tones and I" "Arizona Zervas" "Arizona Zervas" ...
##  $ track_popularity        : int  100 100 99 99 99 99 98 98 98 98 ...
##  $ track_album_id          : chr  "0UywfDKYlyiu1b38DRrzYD" "0UywfDKYlyiu1b38DRrzYD" "6HJDrXs0hpebaRFKA1sF90" "6HJDrXs0hpebaRFKA1sF90" ...
##  $ track_album_name        : chr  "Dance Monkey (Stripped Back) / Dance Monkey" "Dance Monkey (Stripped Back) / Dance Monkey" "ROXANNE" "ROXANNE" ...
##  $ track_album_release_date: chr  "10/17/2019" "10/17/2019" "10/10/2019" "10/10/2019" ...
##  $ playlist_name           : chr  "post-teen alternative, indie, pop (large variety)" "Global Top 50 | 2020 Hits" "Global Top 50 | 2020 Hits" "Contemporary Urban" ...
##  $ playlist_id             : chr  "1y42gwI5cuwjBslPyQNfqb" "1KNl4AYfgZtOVm9KHkhPTF" "1KNl4AYfgZtOVm9KHkhPTF" "6wyJ4bsjZaUKa9f6GeZlAO" ...
##  $ playlist_genre          : chr  "pop" "latin" "latin" "r&b" ...
##  $ playlist_subgenre       : chr  "post-teen pop" "latin hip hop" "latin hip hop" "urban contemporary" ...
##  $ danceability            : num  0.824 0.824 0.621 0.621 0.621 0.621 0.803 0.764 0.513 0.764 ...
##  $ energy                  : num  0.588 0.588 0.601 0.601 0.601 0.601 0.715 0.32 0.796 0.32 ...
##  $ key                     : int  6 6 6 6 6 6 2 11 1 11 ...
##  $ loudness                : num  -6.4 -6.4 -5.62 -5.62 -5.62 ...
##  $ mode                    : int  0 0 0 0 0 0 1 1 1 1 ...
##  $ speechiness             : num  0.0924 0.0924 0.148 0.148 0.148 0.148 0.298 0.0546 0.0629 0.0546 ...
##  $ acousticness            : num  0.692 0.692 0.0522 0.0522 0.0522 0.0522 0.295 0.837 0.00147 0.837 ...
##  $ instrumentalness        : num  0.000104 0.000104 0 0 0 0 0.000134 0 0.000209 0 ...
##  $ liveness                : num  0.149 0.149 0.46 0.46 0.46 0.46 0.0574 0.0822 0.0938 0.0822 ...
##  $ valence                 : num  0.513 0.513 0.457 0.457 0.457 0.457 0.574 0.575 0.345 0.575 ...
##  $ tempo                   : num  98 98 117 117 117 ...
##  $ duration_ms             : int  209438 209438 163636 163636 163636 163636 200960 189486 201573 189486 ...

Checking for missing values in the dataset

table(is.na(spotify))
## 
##  FALSE   TRUE 
## 755144     15

The are some observations with missing values

Finding the columns with missing values

sapply(spotify, function(x) sum(is.na(x)))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

We find that the columns with null values are basically columns with artist, track and album names.

Let us see the observations which contain these values.

spotify_null <- spotify[!complete.cases(spotify),]
spotify_null 
##                     track_id track_name track_artist track_popularity
## 30558 69gRFGOWY9OMpFJgFol1u0       <NA>         <NA>                0
## 30759 5cjecvX0CmC9gK0Laf5EMQ       <NA>         <NA>                0
## 30760 5TTzhRSWQS4Yu8xTgAuq6D       <NA>         <NA>                0
## 31405 3VKFip3OdAvv4OfNTgFWeQ       <NA>         <NA>                0
## 31428 69gRFGOWY9OMpFJgFol1u0       <NA>         <NA>                0
##               track_album_id track_album_name track_album_release_date
## 30558 717UG2du6utFe7CdmpuUe3             <NA>                 1/5/2012
## 30759 3luHJEPw434tvNbme3SP8M             <NA>                12/1/2017
## 30760 3luHJEPw434tvNbme3SP8M             <NA>                12/1/2017
## 31405 717UG2du6utFe7CdmpuUe3             <NA>                 1/5/2012
## 31428 717UG2du6utFe7CdmpuUe3             <NA>                 1/5/2012
##               playlist_name            playlist_id playlist_genre
## 30558               HIP&HOP 5DyJsJZOpMJh34WvUrQzMV            rap
## 30759           GANGSTA Rap 5GA8GDo7RQC3JEanT81B3g            rap
## 30760           GANGSTA Rap 5GA8GDo7RQC3JEanT81B3g            rap
## 31405 Reggaeton viejito🔥 0si5tw70PIgPkY1Eva6V8f          latin
## 31428         latin hip hop 3nH8aytdqNeRbcRCg3dw9q          latin
##       playlist_subgenre danceability energy key loudness mode speechiness
## 30558  southern hip hop        0.714  0.821   6   -7.635    1      0.1760
## 30759      gangster rap        0.678  0.659  11   -5.364    0      0.3190
## 30760      gangster rap        0.465  0.820  10   -5.907    0      0.3070
## 31405         reggaeton        0.675  0.919  11   -6.075    0      0.0366
## 31428     latin hip hop        0.714  0.821   6   -7.635    1      0.1760
##       acousticness instrumentalness liveness valence   tempo duration_ms
## 30558       0.0410          0.00000   0.1160   0.649  95.999      282707
## 30759       0.0534          0.00000   0.5530   0.191 146.153      202235
## 30760       0.0963          0.00000   0.0888   0.505  86.839      206465
## 31405       0.0606          0.00653   0.1030   0.726  97.017      252773
## 31428       0.0410          0.00000   0.1160   0.649  95.999      282707

Since only the names of the tracks, the artist and album names are Null, I am not going to treat them and keep them as it is for my analysis.

Exploratory data analysis - Visualizations

I am mostly going to use visualizations to find insights into the data. Visualizations help us find anomalies and general trends in they data and also help us in coming with insights much easily.

Song genre - which is the most popular genre in spotify

genre <- Corpus(VectorSource(spotify$playlist_genre))
genre_dtm <- DocumentTermMatrix(genre)
genre_freq <- colSums(as.matrix(genre_dtm))
freq <- sort(colSums(as.matrix(genre_dtm)), decreasing=TRUE) 
genre_wf <- data.frame(word=names(genre_freq), freq=genre_freq)

ggplot(genre_wf, aes(x=reorder(word,-freq), y=freq, fill = "red"))+ geom_bar(stat="identity")+  theme(axis.text.x=element_text(angle=45, hjust=1))+ ggtitle("Most Popular genres in Spotify")+ xlab("Genre")+ ylab("Frequency")

spotify %>% select_if(is.numeric) %>% pairs()

Since it is very difficult to read from here,

we try a correlation matrix

options(repr.plot.width = 20, repr.plot.height = 15)
spotify_sliced <- spotify[sapply(spotify, is.numeric)]
corr <- cor(spotify_sliced)
library(corrplot)
## corrplot 0.84 loaded
num <- corrplot(corr, method = "number")

Relationship between Acoutsicness and genre

boxplot(acousticness~playlist_genre, data=spotify,
main = "Variation of acousticness between genres",
xlab = "acousticness",
ylab = "Genre",
col = "magenta",
border = "red",
horizontal = TRUE,
notch = TRUE
)

boxplot(energy~playlist_genre, data=spotify,
main = "Variation of energy between genres",
xlab = "Energy",
ylab = "Genre",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)

Higher energy comes from songs of the genre Edm.

Analysis of genre by loudness

boxplot(loudness~playlist_genre, data=spotify,
main = "Variation of valence between genres",
xlab = "Loudness",
ylab = "Genre",
col = "green",
border = "dark green",
horizontal = TRUE,
notch = TRUE
)

Songs in the edm genre are generally louder in nature.

We also saw there is some correlation between these three factors. Let us see how these are distibuted among the top 50 songs

spotify_arrange <- spotify[order(-spotify$track_popularity),]
corr_deb <- ggplot(head(spotify_arrange,50)) +
    geom_density(aes(energy, fill ="energy", alpha = 0.01)) + 
    geom_density(aes(valence, fill ="valence", alpha = 0.01)) + 
    geom_density(aes(danceability, fill ="danceability", alpha = 0.01)) + 
    scale_x_continuous(name = "Energy, Valence and danceability") +
    scale_y_continuous(name = "Density") +
    ggtitle("Density plot of Energy, Valence and danceability") +
    theme_bw() +
    theme(plot.title = element_text(size = 14, face = "bold"),
          text = element_text(size = 12)) +
    theme(legend.title=element_blank()) +
    scale_fill_brewer(palette="Accent")

corr_deb

We can see that theyre more overlapped and left skewed. These variables have positive correlation between them.

spotify_arrange <- spotify[order(-spotify$track_popularity),]
popularity_vote <- head(spotify_arrange,50) %>%
  select(energy, track_artist, speechiness, tempo, playlist_genre) %>%
  group_by(energy)%>%
  filter(!is.na(energy)) %>%
  filter(!is.na(track_artist))%>%
  filter(!is.na(speechiness))%>%
  filter(!is.na(tempo))%>%
  ggplot(mapping = aes(x = track_artist, y = energy, color = tempo, alpha = speechiness, fill = playlist_genre))+
  geom_bar(stat = 'identity')+
  coord_polar()+
  theme_minimal()

popularity_vote 

3-D Scatterplot

spotify_arrange <- spotify[order(-spotify$track_popularity),]
#library(plotly)
plot_ly(head(spotify_arrange, n=500), x = ~track_popularity, y = ~energy, z = ~valence, 
        color = ~playlist_genre, colors = c('magenta', 'green', 'red', 'cyan', 'orange', 'yellow'),size = I(100)) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'track_popularity'),
                      yaxis = list(title = 'Energy'),
                      zaxis = list(title = 'Valence')),
         title = "3D Scatter plot: Track_popularity vs Energy vs Valence",
         showlegend = FALSE)

We can clearly see that the popularity of a song is directly related to it’s energy

Most important factorz in detemining popularity

We examined all the above factors and saw the correlation between various factors and how each factor is affecting popularity. Now, from a modeling perspective, I want find the factors which most affect the popularity and differentiates one song from another making it more or less popular than the other.

library(rpart)
## Warning: package 'rpart' was built under R version 3.6.2
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.6.2
tree_model <- rpart(track_popularity ~ loudness+valence+energy+key+danceability+acousticness+playlist_genre, data = spotify_arrange)
rpart.plot(tree_model, box.palette = "GnBu")

We see that out of the above factors of loudness, acousticness, playlist_genre, energy, valence, the factors genre plays a majoy role as 35% of the top songs are edm. Loudness plays a role as well as in songs with loudness > -7.4 have a good chance in being in the top songs.

Summarised findings

These are the findings that I have observed from the analysis thus far on the spotify dataset.

  • The top 3 genres in spotify are - edm, rap and pop.
  • Energy might have some effect on the popularity of data which needs to be explored further.
  • There are many artists who span out in different genres like Roddy Rich, Karol G among many others.
  • Genre plays a very important role in paving way for top spot for a song
  • Loudness and energy have a positive effect on the popularity of a song.