ANALYZING SONGS FROM SPOTIFY
Spotify is an audio streaming provider which offers recorded music and podcasts for more than 70 million songs. The focus of this project is to analyze data about various songs which are streaming on Spotify and uncover some interesting trends and most important factors that lead to the popularity of songs.
library(tidyr) # Tidyr package is used to tidy data. Tidy data is data that’s easy to work with.
library(dplyr) # dplyr is a package for making tabular data wrangling easier by using a limited set of functions that can be combined to extract and summarize insights from your data
library(ggplot2) # This package can be used to create interesting visualisations
library(DT) # It is used to display tables in HTML
library(tidyverse) # Used for other wrangling functions not included in above packages
library(corrplot) # It is used to plot correlation between variables
library(RColorBrewer) # it is used to format the correlation plot
Source: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md
Importing the data into R:
spotify <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
knitr::kable(head(spotify), align = "lccrr")
| track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
| 0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
| 1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
| 75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
| 1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
| 7fvUMiyapMsRRxr07cU8Ef | Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | 2yiy9cd2QktrNvWC2EUi0k | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 |
The data contains 32,833 rows and 23 columns in the data and was collected in Jan-20. Our target variable is track_popularity, and the data has various other features dealing with our target variable like song credits and song features like danceability, energy, acousticness etc. There 15 missing values in the data which are removed in data cleaning.
d <- read.csv("spotify_dd.csv")
DT::datatable(d,options = list( pageLength=50, scrollX = T,autoWidth = TRUE),class = 'cell-border stripe')
Data Cleaning
ids <- c("track_id","track_album_id","playlist_id")
spotify.data <- data.frame(spotify[,!(names(spotify) %in% ids)])
Identifying missing data in the dataset and removing the rows
colSums(is.na(spotify.data))
## track_name track_artist track_popularity
## 5 5 0
## track_album_name track_album_release_date playlist_name
## 5 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
spotify.data <- na.omit(spotify.data)
colSums(is.na(spotify.data))
## track_name track_artist track_popularity
## 0 0 0
## track_album_name track_album_release_date playlist_name
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
Converting categorical variables ‘key’ and ‘mode’ as factors
spotify.data$key <- as.factor(spotify.data$key)
spotify.data$mode <- as.factor(spotify.data$mode)
knitr::kable(head(spotify.data,2), align = "lccrr")
| track_name | track_artist | track_popularity | track_album_name | track_album_release_date | playlist_name | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00000 | 0.0653 | 0.518 | 122.036 | 194754 |
| Memories - Dillon Francis Remix | Maroon 5 | 67 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 0.00421 | 0.3570 | 0.693 | 99.972 | 162600 |
Checking for outliers and distribution for each numerical column:
Boxplots
num_cols<-c('danceability','energy','loudness','speechiness','acousticness','instrumentalness',
'liveness','valence','track_popularity')
par(mfrow=c(3,3))
for (i in num_cols){
boxplot(spotify.data[[i]], main=sprintf('Boxplot of %s',i))
}
Outliers: There are a few outliers we can see from the box plots but we see that the majority of these metrics are normally distributed. Manipulating these outliers may not add much value to the analysis.
d1 <- read.csv("summary.csv")
DT::datatable(d1,options = list(pageLength=50, scrollX = T,autoWidth = TRUE),class = 'cell-border stripe')
The distribution of the variables has been detailed in the graph below which shows a non normal distribution the primary motivation behind this is to understand how the popularity distributes with other numerical variables.
par(mfrow=c(3,3))
for (i in num_cols){
hist(x = spotify.data[[i]],
col="blue",
lty=1,
freq = FALSE,
main = sprintf('Histogram of %s',i),
xlab = i)
lines(density(x = spotify.data[[i]],na.rm=TRUE), lwd=2, col='red')
}
The motivation behind isolating the numerical variables is to check for strong correlation between the variables and see which variable is needed to guage the popularity among the songs
spotify_numeric <- c("track_popularity" ,"danceability","energy","key","loudness","mode","speechiness","acousticness","instrumentalness","liveness","valence","tempo","duration_ms")
M <- cor(spotify[,spotify_numeric])
corrplot(M, type="lower", order="hclust", method = "number",
col=brewer.pal(n=8, name="RdYlBu"))
The bar chart shows which artist has highest number of songs. The number of songs could indicate the presence of popularity for that artiste compared to others. The number of songs could also increase an artists chance to have a successful songs in his discography
popular_artist <- spotify.data %>% group_by(track_artist) %>% summarize(no_of_tracks = n()) %>% top_n(15) %>% arrange(desc(no_of_tracks))
ggplot(popular_artist,aes(x=fct_reorder(track_artist,no_of_tracks),no_of_tracks)) + geom_bar(stat='identity') + labs(y="Number of songs", x ="Artist Name")+ggtitle("Artist with the most number of tracks in spotify") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
The below boxplot shows how the popularity varies with the genre for a playlist for the song
boxplot(track_popularity~playlist_genre, data = spotify.data, main = "Popularity vs Genre")
Popularity doesn’t seem to vary drastically with the ‘Key’ of a track
boxplot(track_popularity~key, data = spotify.data, main = "Popularity vs Key")
model_lm <- lm(track_popularity~danceability+ energy + loudness + speechiness + acousticness + liveness + valence + duration_ms, data = spotify.data)
summary(model_lm)
##
## Call:
## lm(formula = track_popularity ~ danceability + energy + loudness +
## speechiness + acousticness + liveness + valence + duration_ms,
## data = spotify.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.432 -17.749 2.716 18.938 71.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.621e+01 1.479e+00 58.296 < 2e-16 ***
## danceability 1.692e+00 1.026e+00 1.648 0.09932 .
## energy -3.431e+01 1.172e+00 -29.275 < 2e-16 ***
## loudness 1.835e+00 6.236e-02 29.427 < 2e-16 ***
## speechiness -4.334e+00 1.350e+00 -3.210 0.00133 **
## acousticness 2.360e+00 7.337e-01 3.217 0.00130 **
## liveness -4.203e+00 8.864e-01 -4.742 2.13e-06 ***
## valence 5.697e+00 6.270e-01 9.085 < 2e-16 ***
## duration_ms -4.688e-05 2.288e-06 -20.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.21 on 32819 degrees of freedom
## Multiple R-squared: 0.06135, Adjusted R-squared: 0.06112
## F-statistic: 268.1 on 8 and 32819 DF, p-value: < 2.2e-16
As expected from the correlation coefficients from the earlier correlation plot, there is no significant relation between the variables and popularity. The variables are also not distributed in a linear way. The R squared for the model is 6% and the danceability variable is close to 0 as the p value for the model is very less
Cleaned the data provided by Spotify to analyze the songs.
Performed Exploratory Data Analysis to identify top trends for the songs in the dataset.
Checked how correlated a song’s popularity is with its numerical features such as danceablity, acousticness etc.
Checked for outliers, dropped the variables that are not relevant to the problem, checked for NAs and handled them accordingly, removed a few rows to remove unwatned noise from the data
Plotted box plots and histograms to identify the distributions of each variables. Checked for number of songs by artist
Plotted the correlation between 2 variables by using a correlation matrix to discover variables that are highly correlated
The metrics Danceability, Energy and Valance are fairly normally distributed unlike other numerical variables
The key insight from this analysis is that among all the variables, Track_popularity has relatively higher correlation with duration, instrumentalness and energy. This suggests that longer songs that have more vocals and have more energy levels seem to be more popular compared to other songs
However, these correlations are not statistically highly significant making it difficult to predict the popularity of any new songs.
The below 3 key findings strengthen the earlier key insight:
-The artist Martin Garrix has the most number of tracks
-Pop songs seem to have relatively higher popularity compared to the tracks from other genres
-The popularity of the songs does not vary with the ‘Key’ of a track, which is counter intuitive considering the importance given to a ‘Key’ while composing a track.
It is evident from the analysis that the composers and the production houses must focus on properties like instrumentalness and energy while composing a track as this would increase the chances of popularizing the track
Also, ‘Key’ for a track may not be an important factor while composing a track
Although few key metrics have been identified to predict the popularity of the track based on the model, they do not seem to be statistically significant. So, the predicted popularities may not be extremely accurate
This dataset is only from a single data source ‘Spotify’. Having the data from other datasets can help us build better models and predict the popularity more accurately