ANALYZING SONGS FROM SPOTIFY

1.Introduction

(1.1) Introduction

Spotify is an audio streaming provider which offers recorded music and podcasts for more than 70 million songs. The focus of this project is to analyze data about various songs which are streaming on Spotify and uncover some interesting trends and most important factors that lead to the popularity of songs.

(1.2) Problem Statement

  • To understand the dataset and what it contains to determine how it can be used for our analysis
  • To perform Data Cleaning if necessary so that the data is usable for our analysis
  • To perform Exploratory Data Analysis and understand the variables contributing to a song’s popularity

(1.3) Approach

  • We will look at the properties of each variable available in the dataset and perform a data cleaning if there is too much missing data or outliers
  • Plot the correlation between popularity of a song vs various variables to see which variable has a good correlation

(1.4) Consumer Impact

  • Determing the most important variables impacting popularity of songs can help creators compose new songs that would reach a wider audience just like Netflix used user data to create better Web Series
  • It can help Companies decide which type of songs or artists to invest in so that they can maximize revenues

2. Packages Required

(2.1 - 2.3) Packages Used

library(tidyr)    # Tidyr package is used to tidy data. Tidy data is data that’s easy to work with. 
library(dplyr)    # dplyr is a package for making tabular data wrangling easier by using a limited set of functions that can be combined to extract and summarize insights from your data 
library(ggplot2)  # This package can be used to create interesting visualisations 
library(DT)       # It is used to display tables in HTML
library(tidyverse) # Used for other wrangling functions not included in above packages
library(corrplot) # It is used to plot correlation between variables
library(RColorBrewer) # it is used to format the correlation plot

3. Data Preparation

(3.1) Data Source

Source: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md

Importing the data into R:

spotify <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
knitr::kable(head(spotify), align = "lccrr") 
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre playlist_subgenre danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
6f807x0ima9a1j3VPbc7VN I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036 194754
0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972 162600
1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008 176616
75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6 Call You Mine - The Remixes 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.718 0.930 7 -3.778 1 0.1020 0.0287 9.40e-06 0.2040 0.277 121.956 169093
1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ Someone You Loved (Future Humans Remix) 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.650 0.833 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976 189052
7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k Beautiful People (feat. Khalid) [Jack Wins Remix] 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.675 0.919 8 -5.385 1 0.1270 0.0799 0.00e+00 0.1430 0.585 124.982 163049

(3.2) About the Dataset

The data contains 32,833 rows and 23 columns in the data and was collected in Jan-20. Our target variable is track_popularity, and the data has various other features dealing with our target variable like song credits and song features like danceability, energy, acousticness etc. There 15 missing values in the data which are removed in data cleaning.

d <- read.csv("spotify_dd.csv") 
DT::datatable(d,options = list(   pageLength=50, scrollX = T,autoWidth = TRUE),class = 'cell-border stripe') 

Data Cleaning

(3.3) Dropping columns that are unique to each song

ids <- c("track_id","track_album_id","playlist_id") 
spotify.data <- data.frame(spotify[,!(names(spotify) %in% ids)])

Identifying missing data in the dataset and removing the rows

colSums(is.na(spotify.data)) 
##               track_name             track_artist         track_popularity 
##                        5                        5                        0 
##         track_album_name track_album_release_date            playlist_name 
##                        5                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
spotify.data <- na.omit(spotify.data) 
colSums(is.na(spotify.data)) 
##               track_name             track_artist         track_popularity 
##                        0                        0                        0 
##         track_album_name track_album_release_date            playlist_name 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

Converting categorical variables ‘key’ and ‘mode’ as factors

spotify.data$key <- as.factor(spotify.data$key)
spotify.data$mode <- as.factor(spotify.data$mode)

(3.4) Final data after cleaning

knitr::kable(head(spotify.data,2), align = "lccrr") 
track_name track_artist track_popularity track_album_name track_album_release_date playlist_name playlist_genre playlist_subgenre danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 66 I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14 Pop Remix pop dance pop 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00000 0.0653 0.518 122.036 194754
Memories - Dillon Francis Remix Maroon 5 67 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix pop dance pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 0.00421 0.3570 0.693 99.972 162600

Checking for outliers and distribution for each numerical column:

Boxplots

num_cols<-c('danceability','energy','loudness','speechiness','acousticness','instrumentalness',
            'liveness','valence','track_popularity')

par(mfrow=c(3,3))
for (i in num_cols){
boxplot(spotify.data[[i]], main=sprintf('Boxplot of  %s',i))
} 

Outliers: There are a few outliers we can see from the box plots but we see that the majority of these metrics are normally distributed. Manipulating these outliers may not add much value to the analysis.

(3.5) Summary of Variables

d1 <- read.csv("summary.csv") 
DT::datatable(d1,options = list(pageLength=50, scrollX = T,autoWidth = TRUE),class = 'cell-border stripe') 

4. Exploratory Data Analysis

The distribution of the variables has been detailed in the graph below which shows a non normal distribution the primary motivation behind this is to understand how the popularity distributes with other numerical variables.

par(mfrow=c(3,3))
for (i in num_cols){
  hist(x = spotify.data[[i]], 
       col="blue",
       lty=1,
       freq = FALSE,
       main = sprintf('Histogram of  %s',i),
       xlab = i)
  lines(density(x = spotify.data[[i]],na.rm=TRUE), lwd=2, col='red')
}

The motivation behind isolating the numerical variables is to check for strong correlation between the variables and see which variable is needed to guage the popularity among the songs

spotify_numeric <- c("track_popularity" ,"danceability","energy","key","loudness","mode","speechiness","acousticness","instrumentalness","liveness","valence","tempo","duration_ms")
M <- cor(spotify[,spotify_numeric])
corrplot(M, type="lower", order="hclust", method = "number",
         col=brewer.pal(n=8, name="RdYlBu"))

The bar chart shows which artist has highest number of songs. The number of songs could indicate the presence of popularity for that artiste compared to others. The number of songs could also increase an artists chance to have a successful songs in his discography

popular_artist <- spotify.data %>% group_by(track_artist) %>% summarize(no_of_tracks = n()) %>% top_n(15) %>% arrange(desc(no_of_tracks))
ggplot(popular_artist,aes(x=fct_reorder(track_artist,no_of_tracks),no_of_tracks)) + geom_bar(stat='identity') + labs(y="Number of songs", x ="Artist Name")+ggtitle("Artist with the most number of tracks in spotify") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

The below boxplot shows how the popularity varies with the genre for a playlist for the song

boxplot(track_popularity~playlist_genre, data = spotify.data, main = "Popularity vs Genre")

Popularity doesn’t seem to vary drastically with the ‘Key’ of a track

boxplot(track_popularity~key, data = spotify.data, main = "Popularity vs Key")

Model:

model_lm <- lm(track_popularity~danceability+ energy + loudness + speechiness + acousticness + liveness + valence + duration_ms, data = spotify.data)

summary(model_lm)
## 
## Call:
## lm(formula = track_popularity ~ danceability + energy + loudness + 
##     speechiness + acousticness + liveness + valence + duration_ms, 
##     data = spotify.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.432 -17.749   2.716  18.938  71.620 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.621e+01  1.479e+00  58.296  < 2e-16 ***
## danceability  1.692e+00  1.026e+00   1.648  0.09932 .  
## energy       -3.431e+01  1.172e+00 -29.275  < 2e-16 ***
## loudness      1.835e+00  6.236e-02  29.427  < 2e-16 ***
## speechiness  -4.334e+00  1.350e+00  -3.210  0.00133 ** 
## acousticness  2.360e+00  7.337e-01   3.217  0.00130 ** 
## liveness     -4.203e+00  8.864e-01  -4.742 2.13e-06 ***
## valence       5.697e+00  6.270e-01   9.085  < 2e-16 ***
## duration_ms  -4.688e-05  2.288e-06 -20.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.21 on 32819 degrees of freedom
## Multiple R-squared:  0.06135,    Adjusted R-squared:  0.06112 
## F-statistic: 268.1 on 8 and 32819 DF,  p-value: < 2.2e-16

As expected from the correlation coefficients from the earlier correlation plot, there is no significant relation between the variables and popularity. The variables are also not distributed in a linear way. The R squared for the model is 6% and the danceability variable is close to 0 as the p value for the model is very less

6. Summary

(6.1) Problems that have been addressed:

  • Cleaned the data provided by Spotify to analyze the songs.

  • Performed Exploratory Data Analysis to identify top trends for the songs in the dataset.

  • Checked how correlated a song’s popularity is with its numerical features such as danceablity, acousticness etc.

(6.2) Methodology followed to address the problems:

  • Checked for outliers, dropped the variables that are not relevant to the problem, checked for NAs and handled them accordingly, removed a few rows to remove unwatned noise from the data

  • Plotted box plots and histograms to identify the distributions of each variables. Checked for number of songs by artist

  • Plotted the correlation between 2 variables by using a correlation matrix to discover variables that are highly correlated

(6.3) Insights:

  • The metrics Danceability, Energy and Valance are fairly normally distributed unlike other numerical variables

  • The key insight from this analysis is that among all the variables, Track_popularity has relatively higher correlation with duration, instrumentalness and energy. This suggests that longer songs that have more vocals and have more energy levels seem to be more popular compared to other songs

However, these correlations are not statistically highly significant making it difficult to predict the popularity of any new songs.

  • The below 3 key findings strengthen the earlier key insight:

    -The artist Martin Garrix has the most number of tracks

    -Pop songs seem to have relatively higher popularity compared to the tracks from other genres

    -The popularity of the songs does not vary with the ‘Key’ of a track, which is counter intuitive considering the importance given to a ‘Key’ while composing a track.

    • The linear regression model has less accuracy and doesn’t explain the data well hence we need to use a more complicated model that can handle nonlinear relations in most of the variables.

(6.4) Implications to the customer:

  • It is evident from the analysis that the composers and the production houses must focus on properties like instrumentalness and energy while composing a track as this would increase the chances of popularizing the track

  • Also, ‘Key’ for a track may not be an important factor while composing a track

(6.5) Limitations:

  • Although few key metrics have been identified to predict the popularity of the track based on the model, they do not seem to be statistically significant. So, the predicted popularities may not be extremely accurate

  • This dataset is only from a single data source ‘Spotify’. Having the data from other datasets can help us build better models and predict the popularity more accurately