ANALYZING SONGS FROM SPOTIFY

1.Introduction

(1.1) Introduction

Spotify is an audio streaming provider which offers recorded music and podcasts for more than 70 million songs. The focus of this project is to analyze data about various songs which are streaming on Spotify and uncover some interesting trends and most important factors that lead to the popularity of songs.

(1.2) Problem Statement

  • To understand the dataset and what it contains to determine how it can be used for our analysis
  • To perform Data Cleaning if necessary so that the data is usable for our analysis
  • To perform Exploratory Data Analysis and understand the variables contributing to a song’s popularity

(1.3) Approach

  • We will look at the properties of each variable available in the dataset and perform a data cleaning if there is too much missing data or outliers
  • Plot the correlation between popularity of a song vs various variables to see which variable has a good correlation

(1.4) Consumer Impact

  • Determing the most important variables impacting popularity of songs can help creators compose new songs that would reach a wider audience just like Netflix used user data to create better Web Series
  • It can help Companies decide which type of songs or artists to invest in so that they can maximize revenues

2. Packages Required

(2.1 - 2.3) Packages Used

library(tidyr)    # Tidyr package is used to tidy data. Tidy data is data that’s easy to work with. 
library(dplyr)    # dplyr is a package for making tabular data wrangling easier by using a limited set of functions that can be combined to extract and summarize insights from your data 
library(ggplot2)  # This package can be used to create interesting visualisations 
library(DT)       # It is used to display tables in HTML

3. Data Preparation

(3.1) Data Source

Source: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md

Importing the data into R:

spotify <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
knitr::kable(head(spotify), align = "lccrr") 
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre playlist_subgenre danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
6f807x0ima9a1j3VPbc7VN I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036 194754
0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972 162600
1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008 176616
75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6 Call You Mine - The Remixes 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.718 0.930 7 -3.778 1 0.1020 0.0287 9.40e-06 0.2040 0.277 121.956 169093
1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ Someone You Loved (Future Humans Remix) 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.650 0.833 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976 189052
7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k Beautiful People (feat. Khalid) [Jack Wins Remix] 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.675 0.919 8 -5.385 1 0.1270 0.0799 0.00e+00 0.1430 0.585 124.982 163049

(3.2) About the Dataset

The data contains 32,833 rows and 23 columns in the data and was collected in Jan-20. Our target variable is track_popularity, and the data has various other features dealing with our target variable like song credits and song features like danceability, energy, acousticness etc. There 15 missing values in the data which are removed in data cleaning.

d <- read.csv("spotify_dd.csv") 
DT::datatable(d,options = list(   pageLength=50, scrollX = T,autoWidth = TRUE),class = 'cell-border stripe') 

Data Cleaning

(3.3) Dropping columns that are unique to each song

ids <- c("track_id","track_album_id","playlist_id") 
spotify.data <- data.frame(spotify[,!(names(spotify) %in% ids)])

Identifying missing data in the dataset and removing the rows

colSums(is.na(spotify.data)) 
##               track_name             track_artist         track_popularity 
##                        5                        5                        0 
##         track_album_name track_album_release_date            playlist_name 
##                        5                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
spotify.data <- na.omit(spotify.data) 
colSums(is.na(spotify.data)) 
##               track_name             track_artist         track_popularity 
##                        0                        0                        0 
##         track_album_name track_album_release_date            playlist_name 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

Converting categorical variables ‘key’ and ‘mode’ as factors

spotify.data$key <- as.factor(spotify.data$key)
spotify.data$mode <- as.factor(spotify.data$mode)

(3.4) Final data after cleaning

knitr::kable(head(spotify.data), align = "lccrr") 
track_name track_artist track_popularity track_album_name track_album_release_date playlist_name playlist_genre playlist_subgenre danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 66 I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14 Pop Remix pop dance pop 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036 194754
Memories - Dillon Francis Remix Maroon 5 67 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix pop dance pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972 162600
All the Time - Don Diablo Remix Zara Larsson 70 All the Time (Don Diablo Remix) 2019-07-05 Pop Remix pop dance pop 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008 176616
Call You Mine - Keanu Silva Remix The Chainsmokers 60 Call You Mine - The Remixes 2019-07-19 Pop Remix pop dance pop 0.718 0.930 7 -3.778 1 0.1020 0.0287 9.40e-06 0.2040 0.277 121.956 169093
Someone You Loved - Future Humans Remix Lewis Capaldi 69 Someone You Loved (Future Humans Remix) 2019-03-05 Pop Remix pop dance pop 0.650 0.833 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976 189052
Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran 67 Beautiful People (feat. Khalid) [Jack Wins Remix] 2019-07-11 Pop Remix pop dance pop 0.675 0.919 8 -5.385 1 0.1270 0.0799 0.00e+00 0.1430 0.585 124.982 163049

Checking for outliers and distribution for each numerical column:

Boxplots

num_cols<-c('danceability','energy','loudness','speechiness','acousticness','instrumentalness',
            'liveness','valence','tempo')

par(mfrow=c(3,3))
for (i in num_cols){
boxplot(spotify.data[[i]], main=sprintf('Histogram of  %s',i))
} 

Histograms

par(mfrow=c(3,3))
for (i in num_cols){
  hist(x = spotify.data[[i]], 
       col="blue",
       lty=1,
       freq = FALSE,
       main = sprintf('Histogram of  %s',i),
       xlab = i)
  lines(density(x = spotify.data[[i]],na.rm=TRUE), lwd=2, col='red')
}

Outliers: There are a few outliers we can see from the box plots but we see that the majority of these metrics are normally distributed. Manipulating these outliers may not add much value to the analysis.

(3.5) Summary of Variables

d1 <- read.csv("summary.csv") 
DT::datatable(d1,options = list(pageLength=50, scrollX = T,autoWidth = TRUE),class = 'cell-border stripe') 

4. Proposed EDA

(4.1) Uncovering new information

  • We will be doing a correlation analysis to check if there is any correlation between the variables themselves

  • We can create a few new columns from the existing columns to check if they can explain the data in a better way. For Example, checking if ‘remix’ songs are more popular than other songs by creating a new column for remix with values 0 or 1

  • We can summarize the data mainly by picking most important qualitative variables such as genre and some numerical variables such as energy, beat and danceability to see how they are impacting popularity

(4.2) Plots and Tables

  • We plan to create Bivariate plots between popularity and other variables that may help us gauge the relationship between the variables
  • Slicing the data by qualitative factors and plotting them on a barchart can provide an insight on whether there is a significant difference in popularity between various levels of these factors.

(4.3) Need to Learn

  • We plan to learn various techniques such as feature engineering for our exploratory analysis. We also are looking at ways to classify the dataset as popular and non-popular and apply various ML techniques in order to gain better insights into the data.

(4.4) ML Techniques

  • We believe that a regression analysis is appropriate to estimate or predict the popularity of a song since it a numerical variable. We can also use ML classification techniques such as logistic regression, Decision Trees to assess popularity by dividing the target variable ‘popularity’ as ‘popular’ and ‘not popular’ based on a cutoff value for popular.