Final Project Evaluation

1.Introduction

(1.1) Introduction

Spotify is an audio streaming provider which offers recorded music and podcasts for more than 70 million songs. The focus of this project is to analyze data about various songs which are streaming on Spotify and uncover some interesting trends and most important factors that lead to the popularity of songs.

(1.2) Problem Statement

To understand the dataset and what it contains to determine how it can be used for our analysis
To perform Data Cleaning if necessary so that the data is usable for our analysis
To perform Exploratory Data Analysis and understand the variables contributing to a song’s popularity

(1.3) Approach

We will look at the properties of each variable available in the dataset and perform a data cleaning if there is too much missing data or outliers
Plot the correlation between popularity of a song vs various variables to see which variable has a good correlation

(1.4) Consumer Impact

Determing the most important variables impacting popularity of songs can help creators compose new songs that would reach a wider audience just like Netflix used user data to create better Web Series
It can help Companies decide which type of songs or artists to invest in so that they can maximize revenues

2. Packages Required

(2.1 - 2.3) Packages Used

library(tidyr)    # Tidyr package is used to tidy data. Tidy data is data that’s easy to work with. 
library(dplyr)    # dplyr is a package for making tabular data wrangling easier by using a limited set of functions that can be combined to extract and summarize insights from your data 
library(ggplot2)  # This package can be used to create interesting visualisations 
library(DT)       # It is used to display tables in HTML
library(tidyverse) # Used for other wrangling functions not included in above packages
library(corrplot) # It is used to plot correlation between variables
library(RColorBrewer) # it is used to format the correlation plot

3. Data Preparation

(3.1) Data Source

Source: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md

Importing the data into R:

spotify <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
knitr::kable(head(spotify), align = "lccrr")

track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	playlist_name	playlist_id	playlist_genre	playlist_subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
6f807x0ima9a1j3VPbc7VN	I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	2oCs0DGTsRO98Gh5ZSl2Cx	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754
0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600
1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616
75FpbthrwQmzHlBJLuGdC7	Call You Mine - Keanu Silva Remix	The Chainsmokers	60	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.718	0.930	7	-3.778	1	0.1020	0.0287	9.40e-06	0.2040	0.277	121.956	169093
1e8PAfcKUYoKkxPhrHqw4x	Someone You Loved - Future Humans Remix	Lewis Capaldi	69	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052
7fvUMiyapMsRRxr07cU8Ef	Beautiful People (feat. Khalid) - Jack Wins Remix	Ed Sheeran	67	2yiy9cd2QktrNvWC2EUi0k	Beautiful People (feat. Khalid) [Jack Wins Remix]	2019-07-11	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.919	8	-5.385	1	0.1270	0.0799	0.00e+00	0.1430	0.585	124.982	163049

(3.2) About the Dataset

The data contains 32,833 rows and 23 columns in the data and was collected in Jan-20. Our target variable is track_popularity, and the data has various other features dealing with our target variable like song credits and song features like danceability, energy, acousticness etc. There 15 missing values in the data which are removed in data cleaning.

d <- read.csv("spotify_dd.csv") 
DT::datatable(d,options = list(   pageLength=50, scrollX = T,autoWidth = TRUE),class = 'cell-border stripe')

Data Cleaning

(3.3) Dropping columns that are unique to each song

ids <- c("track_id","track_album_id","playlist_id") 
spotify.data <- data.frame(spotify[,!(names(spotify) %in% ids)])

Identifying missing data in the dataset and removing the rows

colSums(is.na(spotify.data))

##               track_name             track_artist         track_popularity 
##                        5                        5                        0 
##         track_album_name track_album_release_date            playlist_name 
##                        5                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

spotify.data <- na.omit(spotify.data) 
colSums(is.na(spotify.data))

##               track_name             track_artist         track_popularity 
##                        0                        0                        0 
##         track_album_name track_album_release_date            playlist_name 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

Converting categorical variables ‘key’ and ‘mode’ as factors

spotify.data$key <- as.factor(spotify.data$key)
spotify.data$mode <- as.factor(spotify.data$mode)

(3.4) Final data after cleaning

knitr::kable(head(spotify.data,2), align = "lccrr")

track_name	track_artist	track_popularity	track_album_name	track_album_release_date	playlist_name	playlist_genre	playlist_subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	pop	dance pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00000	0.0653	0.518	122.036	194754
Memories - Dillon Francis Remix	Maroon 5	67	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	pop	dance pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	0.00421	0.3570	0.693	99.972	162600

Checking for outliers and distribution for each numerical column:

Boxplots

num_cols<-c('danceability','energy','loudness','speechiness','acousticness','instrumentalness',
            'liveness','valence','track_popularity')

par(mfrow=c(3,3))
for (i in num_cols){
boxplot(spotify.data[[i]], main=sprintf('Boxplot of  %s',i))
}

Outliers: There are a few outliers we can see from the box plots but we see that the majority of these metrics are normally distributed. Manipulating these outliers may not add much value to the analysis.

(3.5) Summary of Variables

d1 <- read.csv("summary.csv") 
DT::datatable(d1,options = list(pageLength=50, scrollX = T,autoWidth = TRUE),class = 'cell-border stripe')

4. Exploratory Data Analysis

The distribution of the variables has been detailed in the graph below which shows a non normal distribution the primary motivation behind this is to understand how the popularity distributes with other numerical variables.

par(mfrow=c(3,3))
for (i in num_cols){
  hist(x = spotify.data[[i]], 
       col="blue",
       lty=1,
       freq = FALSE,
       main = sprintf('Histogram of  %s',i),
       xlab = i)
  lines(density(x = spotify.data[[i]],na.rm=TRUE), lwd=2, col='red')
}

The motivation behind isolating the numerical variables is to check for strong correlation between the variables and see which variable is needed to guage the popularity among the songs

spotify_numeric <- c("track_popularity" ,"danceability","energy","key","loudness","mode","speechiness","acousticness","instrumentalness","liveness","valence","tempo","duration_ms")
M <- cor(spotify[,spotify_numeric])
corrplot(M, type="lower", order="hclust", method = "number",
         col=brewer.pal(n=8, name="RdYlBu"))

The bar chart shows which artist has highest number of songs. The number of songs could indicate the presence of popularity for that artiste compared to others. The number of songs could also increase an artists chance to have a successful songs in his discography

popular_artist <- spotify.data %>% group_by(track_artist) %>% summarize(no_of_tracks = n()) %>% top_n(15) %>% arrange(desc(no_of_tracks))

ggplot(popular_artist,aes(x=fct_reorder(track_artist,no_of_tracks),no_of_tracks)) + geom_bar(stat='identity') + labs(y="Number of songs", x ="Artist Name")+ggtitle("Artist with the most number of tracks in spotify") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

The below boxplot shows how the popularity varies with the genre for a playlist for the song

boxplot(track_popularity~playlist_genre, data = spotify.data, main = "Popularity vs Genre")

Popularity doesn’t seem to vary drastically with the ‘Key’ of a track

boxplot(track_popularity~key, data = spotify.data, main = "Popularity vs Key")

Model:

model_lm <- lm(track_popularity~danceability+ energy + loudness + speechiness + acousticness + liveness + valence + duration_ms, data = spotify.data)

summary(model_lm)

## 
## Call:
## lm(formula = track_popularity ~ danceability + energy + loudness + 
##     speechiness + acousticness + liveness + valence + duration_ms, 
##     data = spotify.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.432 -17.749   2.716  18.938  71.620 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.621e+01  1.479e+00  58.296  < 2e-16 ***
## danceability  1.692e+00  1.026e+00   1.648  0.09932 .  
## energy       -3.431e+01  1.172e+00 -29.275  < 2e-16 ***
## loudness      1.835e+00  6.236e-02  29.427  < 2e-16 ***
## speechiness  -4.334e+00  1.350e+00  -3.210  0.00133 ** 
## acousticness  2.360e+00  7.337e-01   3.217  0.00130 ** 
## liveness     -4.203e+00  8.864e-01  -4.742 2.13e-06 ***
## valence       5.697e+00  6.270e-01   9.085  < 2e-16 ***
## duration_ms  -4.688e-05  2.288e-06 -20.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.21 on 32819 degrees of freedom
## Multiple R-squared:  0.06135,    Adjusted R-squared:  0.06112 
## F-statistic: 268.1 on 8 and 32819 DF,  p-value: < 2.2e-16

As expected from the correlation coefficients from the earlier correlation plot, there is no significant relation between the variables and popularity. The variables are also not distributed in a linear way. The R squared for the model is 6% and the danceability variable is close to 0 as the p value for the model is very less

6. Summary

(6.1) Problems that have been addressed:

Cleaned the data provided by Spotify to analyze the songs.
Performed Exploratory Data Analysis to identify top trends for the songs in the dataset.
Checked how correlated a song’s popularity is with its numerical features such as danceablity, acousticness etc.

(6.2) Methodology followed to address the problems:

Checked for outliers, dropped the variables that are not relevant to the problem, checked for NAs and handled them accordingly, removed a few rows to remove unwatned noise from the data
Plotted box plots and histograms to identify the distributions of each variables. Checked for number of songs by artist
Plotted the correlation between 2 variables by using a correlation matrix to discover variables that are highly correlated

(6.3) Insights:

The metrics Danceability, Energy and Valance are fairly normally distributed unlike other numerical variables
The key insight from this analysis is that among all the variables, Track_popularity has relatively higher correlation with duration, instrumentalness and energy. This suggests that longer songs that have more vocals and have more energy levels seem to be more popular compared to other songs

However, these correlations are not statistically highly significant making it difficult to predict the popularity of any new songs.

The below 3 key findings strengthen the earlier key insight:

-The artist Martin Garrix has the most number of tracks

-Pop songs seem to have relatively higher popularity compared to the tracks from other genres

-The popularity of the songs does not vary with the ‘Key’ of a track, which is counter intuitive considering the importance given to a ‘Key’ while composing a track.
- The linear regression model has less accuracy and doesn’t explain the data well hence we need to use a more complicated model that can handle nonlinear relations in most of the variables.

(6.4) Implications to the customer:

It is evident from the analysis that the composers and the production houses must focus on properties like instrumentalness and energy while composing a track as this would increase the chances of popularizing the track
Also, ‘Key’ for a track may not be an important factor while composing a track

(6.5) Limitations:

Although few key metrics have been identified to predict the popularity of the track based on the model, they do not seem to be statistically significant. So, the predicted popularities may not be extremely accurate
This dataset is only from a single data source ‘Spotify’. Having the data from other datasets can help us build better models and predict the popularity more accurately