In this project, I shall be analysing some of the audio features of the songs avaibale on spotify in order to guage and come up with an insight about the factors which can contribute directly or indirectly to the popularity of a song. The data was obtained from https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md The data had features connected to the track such as the artist and genre details as well information about the musicality of the song such as valence, loudness, energy etc.
The problem that I am trying to address with this project is to try to find and analyse the features which directly contribute to the popularity of a song. There are many different kinds of songs which are releasing every day. What makes a song more likeable than another? Why is one genre more popular than another? These are some of the questions I plan on addressing through this project.
At the initial stage of the project, I plan on performing some exploratory data analysis using visualizations to come up with solutions to those questions.
This analysis might be helpful for artists in order to come with songs tailored to meet the expectation of their audience, and also in getting better reach for their songs.
These are the packages that I have used for this analysis so far:
library(tibble)
library(DT)
## Warning: package 'DT' was built under R version 3.6.2
library(knitr)
library(tm)
## Warning: package 'tm' was built under R version 3.6.2
## Loading required package: NLP
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.2
## Loading required package: RColorBrewer
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(GGally)
## Warning: package 'GGally' was built under R version 3.6.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(highcharter)
## Warning: package 'highcharter' was built under R version 3.6.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
The following steps were performed to prepare the data and initial analysis was done
spotify <- read.csv("C:/Users/leksh/Documents/SS2020/R/Week4/spotify_songs.csv", stringsAsFactors = FALSE)
head(spotify)
## track_id track_name track_artist track_popularity
## 1 2XU0oxnq2qxCpomAAuJY8K Dance Monkey Tones and I 100
## 2 2XU0oxnq2qxCpomAAuJY8K Dance Monkey Tones and I 100
## 3 696DnlkuDOXcMAnKlTgXXK ROXANNE Arizona Zervas 99
## 4 696DnlkuDOXcMAnKlTgXXK ROXANNE Arizona Zervas 99
## 5 696DnlkuDOXcMAnKlTgXXK ROXANNE Arizona Zervas 99
## 6 696DnlkuDOXcMAnKlTgXXK ROXANNE Arizona Zervas 99
## track_album_id track_album_name
## 1 0UywfDKYlyiu1b38DRrzYD Dance Monkey (Stripped Back) / Dance Monkey
## 2 0UywfDKYlyiu1b38DRrzYD Dance Monkey (Stripped Back) / Dance Monkey
## 3 6HJDrXs0hpebaRFKA1sF90 ROXANNE
## 4 6HJDrXs0hpebaRFKA1sF90 ROXANNE
## 5 6HJDrXs0hpebaRFKA1sF90 ROXANNE
## 6 6HJDrXs0hpebaRFKA1sF90 ROXANNE
## track_album_release_date
## 1 10/17/2019
## 2 10/17/2019
## 3 10/10/2019
## 4 10/10/2019
## 5 10/10/2019
## 6 10/10/2019
## playlist_name
## 1 post-teen alternative, indie, pop (large variety)
## 2 Global Top 50 | 2020 Hits
## 3 Global Top 50 | 2020 Hits
## 4 Contemporary Urban
## 5 Charts 2020 🔥Top 2020🔥Hits 2020🔥Summer 2020🔥Pop 2020🔥Popular Music🔥Clean Pop 2020🔥Sing Alongs
## 6 Charts 2020 🔥Top 2020🔥Hits 2020🔥Summer 2020🔥Pop 2020🔥Popular Music🔥Clean Pop 2020🔥Sing Alongs
## playlist_id playlist_genre playlist_subgenre danceability energy
## 1 1y42gwI5cuwjBslPyQNfqb pop post-teen pop 0.824 0.588
## 2 1KNl4AYfgZtOVm9KHkhPTF latin latin hip hop 0.824 0.588
## 3 1KNl4AYfgZtOVm9KHkhPTF latin latin hip hop 0.621 0.601
## 4 6wyJ4bsjZaUKa9f6GeZlAO r&b urban contemporary 0.621 0.601
## 5 3xMQTDLOIGvj3lWH5e5x6F r&b hip pop 0.621 0.601
## 6 3xMQTDLOIGvj3lWH5e5x6F edm pop edm 0.621 0.601
## key loudness mode speechiness acousticness instrumentalness liveness valence
## 1 6 -6.400 0 0.0924 0.6920 0.000104 0.149 0.513
## 2 6 -6.400 0 0.0924 0.6920 0.000104 0.149 0.513
## 3 6 -5.616 0 0.1480 0.0522 0.000000 0.460 0.457
## 4 6 -5.616 0 0.1480 0.0522 0.000000 0.460 0.457
## 5 6 -5.616 0 0.1480 0.0522 0.000000 0.460 0.457
## 6 6 -5.616 0 0.1480 0.0522 0.000000 0.460 0.457
## tempo duration_ms
## 1 98.027 209438
## 2 98.027 209438
## 3 116.735 163636
## 4 116.735 163636
## 5 116.735 163636
## 6 116.735 163636
colnames(spotify)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
dim(spotify)
## [1] 32833 23
We observe that there are 32833 rows and 23 columns in the dataset.
str(spotify)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "2XU0oxnq2qxCpomAAuJY8K" "2XU0oxnq2qxCpomAAuJY8K" "696DnlkuDOXcMAnKlTgXXK" "696DnlkuDOXcMAnKlTgXXK" ...
## $ track_name : chr "Dance Monkey" "Dance Monkey" "ROXANNE" "ROXANNE" ...
## $ track_artist : chr "Tones and I" "Tones and I" "Arizona Zervas" "Arizona Zervas" ...
## $ track_popularity : int 100 100 99 99 99 99 98 98 98 98 ...
## $ track_album_id : chr "0UywfDKYlyiu1b38DRrzYD" "0UywfDKYlyiu1b38DRrzYD" "6HJDrXs0hpebaRFKA1sF90" "6HJDrXs0hpebaRFKA1sF90" ...
## $ track_album_name : chr "Dance Monkey (Stripped Back) / Dance Monkey" "Dance Monkey (Stripped Back) / Dance Monkey" "ROXANNE" "ROXANNE" ...
## $ track_album_release_date: chr "10/17/2019" "10/17/2019" "10/10/2019" "10/10/2019" ...
## $ playlist_name : chr "post-teen alternative, indie, pop (large variety)" "Global Top 50 | 2020 Hits" "Global Top 50 | 2020 Hits" "Contemporary Urban" ...
## $ playlist_id : chr "1y42gwI5cuwjBslPyQNfqb" "1KNl4AYfgZtOVm9KHkhPTF" "1KNl4AYfgZtOVm9KHkhPTF" "6wyJ4bsjZaUKa9f6GeZlAO" ...
## $ playlist_genre : chr "pop" "latin" "latin" "r&b" ...
## $ playlist_subgenre : chr "post-teen pop" "latin hip hop" "latin hip hop" "urban contemporary" ...
## $ danceability : num 0.824 0.824 0.621 0.621 0.621 0.621 0.803 0.764 0.513 0.764 ...
## $ energy : num 0.588 0.588 0.601 0.601 0.601 0.601 0.715 0.32 0.796 0.32 ...
## $ key : int 6 6 6 6 6 6 2 11 1 11 ...
## $ loudness : num -6.4 -6.4 -5.62 -5.62 -5.62 ...
## $ mode : int 0 0 0 0 0 0 1 1 1 1 ...
## $ speechiness : num 0.0924 0.0924 0.148 0.148 0.148 0.148 0.298 0.0546 0.0629 0.0546 ...
## $ acousticness : num 0.692 0.692 0.0522 0.0522 0.0522 0.0522 0.295 0.837 0.00147 0.837 ...
## $ instrumentalness : num 0.000104 0.000104 0 0 0 0 0.000134 0 0.000209 0 ...
## $ liveness : num 0.149 0.149 0.46 0.46 0.46 0.46 0.0574 0.0822 0.0938 0.0822 ...
## $ valence : num 0.513 0.513 0.457 0.457 0.457 0.457 0.574 0.575 0.345 0.575 ...
## $ tempo : num 98 98 117 117 117 ...
## $ duration_ms : int 209438 209438 163636 163636 163636 163636 200960 189486 201573 189486 ...
table(is.na(spotify))
##
## FALSE TRUE
## 755144 15
The are some observations with missing values
sapply(spotify, function(x) sum(is.na(x)))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
We find that the columns with null values are basically columns with artist, track and album names.
Let us see the observations which contain these values.
spotify_null <- spotify[!complete.cases(spotify),]
spotify_null
## track_id track_name track_artist track_popularity
## 30558 69gRFGOWY9OMpFJgFol1u0 <NA> <NA> 0
## 30759 5cjecvX0CmC9gK0Laf5EMQ <NA> <NA> 0
## 30760 5TTzhRSWQS4Yu8xTgAuq6D <NA> <NA> 0
## 31405 3VKFip3OdAvv4OfNTgFWeQ <NA> <NA> 0
## 31428 69gRFGOWY9OMpFJgFol1u0 <NA> <NA> 0
## track_album_id track_album_name track_album_release_date
## 30558 717UG2du6utFe7CdmpuUe3 <NA> 1/5/2012
## 30759 3luHJEPw434tvNbme3SP8M <NA> 12/1/2017
## 30760 3luHJEPw434tvNbme3SP8M <NA> 12/1/2017
## 31405 717UG2du6utFe7CdmpuUe3 <NA> 1/5/2012
## 31428 717UG2du6utFe7CdmpuUe3 <NA> 1/5/2012
## playlist_name playlist_id playlist_genre
## 30558 HIP&HOP 5DyJsJZOpMJh34WvUrQzMV rap
## 30759 GANGSTA Rap 5GA8GDo7RQC3JEanT81B3g rap
## 30760 GANGSTA Rap 5GA8GDo7RQC3JEanT81B3g rap
## 31405 Reggaeton viejito🔥 0si5tw70PIgPkY1Eva6V8f latin
## 31428 latin hip hop 3nH8aytdqNeRbcRCg3dw9q latin
## playlist_subgenre danceability energy key loudness mode speechiness
## 30558 southern hip hop 0.714 0.821 6 -7.635 1 0.1760
## 30759 gangster rap 0.678 0.659 11 -5.364 0 0.3190
## 30760 gangster rap 0.465 0.820 10 -5.907 0 0.3070
## 31405 reggaeton 0.675 0.919 11 -6.075 0 0.0366
## 31428 latin hip hop 0.714 0.821 6 -7.635 1 0.1760
## acousticness instrumentalness liveness valence tempo duration_ms
## 30558 0.0410 0.00000 0.1160 0.649 95.999 282707
## 30759 0.0534 0.00000 0.5530 0.191 146.153 202235
## 30760 0.0963 0.00000 0.0888 0.505 86.839 206465
## 31405 0.0606 0.00653 0.1030 0.726 97.017 252773
## 31428 0.0410 0.00000 0.1160 0.649 95.999 282707
Since only the names of the tracks, the artist and album names are Null, I am not going to treat them and keep them as it is for my analysis.
I am mostly going to use visualizations to find insights into the data. Visualizations help us find anomalies and general trends in they data and also help us in coming with insights much easily.
Song genre - which is the most popular genre in spotify
genre <- Corpus(VectorSource(spotify$playlist_genre))
genre_dtm <- DocumentTermMatrix(genre)
genre_freq <- colSums(as.matrix(genre_dtm))
freq <- sort(colSums(as.matrix(genre_dtm)), decreasing=TRUE)
genre_wf <- data.frame(word=names(genre_freq), freq=genre_freq)
ggplot(genre_wf, aes(x=reorder(word,-freq), y=freq, fill = "red"))+ geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=45, hjust=1))+ ggtitle("Most Popular genres in Spotify")+ xlab("Genre")+ ylab("Frequency")
spotify %>% select_if(is.numeric) %>% pairs()
Since it is very difficult to read from here,
we try a correlation matrix
options(repr.plot.width = 20, repr.plot.height = 15)
spotify_sliced <- spotify[sapply(spotify, is.numeric)]
corr <- cor(spotify_sliced)
library(corrplot)
## corrplot 0.84 loaded
num <- corrplot(corr, method = "number")
boxplot(acousticness~playlist_genre, data=spotify,
main = "Variation of acousticness between genres",
xlab = "acousticness",
ylab = "Genre",
col = "magenta",
border = "red",
horizontal = TRUE,
notch = TRUE
)
boxplot(energy~playlist_genre, data=spotify,
main = "Variation of energy between genres",
xlab = "Energy",
ylab = "Genre",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
Higher energy comes from songs of the genre Edm.
boxplot(loudness~playlist_genre, data=spotify,
main = "Variation of valence between genres",
xlab = "Loudness",
ylab = "Genre",
col = "green",
border = "dark green",
horizontal = TRUE,
notch = TRUE
)
Songs in the edm genre are generally louder in nature.
We also saw there is some correlation between these three factors. Let us see how these are distibuted among the top 50 songs
spotify_arrange <- spotify[order(-spotify$track_popularity),]
corr_deb <- ggplot(head(spotify_arrange,50)) +
geom_density(aes(energy, fill ="energy", alpha = 0.01)) +
geom_density(aes(valence, fill ="valence", alpha = 0.01)) +
geom_density(aes(danceability, fill ="danceability", alpha = 0.01)) +
scale_x_continuous(name = "Energy, Valence and danceability") +
scale_y_continuous(name = "Density") +
ggtitle("Density plot of Energy, Valence and danceability") +
theme_bw() +
theme(plot.title = element_text(size = 14, face = "bold"),
text = element_text(size = 12)) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="Accent")
corr_deb
We can see that theyre more overlapped and left skewed. These variables have positive correlation between them.
spotify_arrange <- spotify[order(-spotify$track_popularity),]
popularity_vote <- head(spotify_arrange,50) %>%
select(energy, track_artist, speechiness, tempo, playlist_genre) %>%
group_by(energy)%>%
filter(!is.na(energy)) %>%
filter(!is.na(track_artist))%>%
filter(!is.na(speechiness))%>%
filter(!is.na(tempo))%>%
ggplot(mapping = aes(x = track_artist, y = energy, color = tempo, alpha = speechiness, fill = playlist_genre))+
geom_bar(stat = 'identity')+
coord_polar()+
theme_minimal()
popularity_vote
spotify_arrange <- spotify[order(-spotify$track_popularity),]
#library(plotly)
plot_ly(head(spotify_arrange, n=500), x = ~track_popularity, y = ~energy, z = ~valence,
color = ~playlist_genre, colors = c('magenta', 'green', 'red', 'cyan', 'orange', 'yellow'),size = I(100)) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'track_popularity'),
yaxis = list(title = 'Energy'),
zaxis = list(title = 'Valence')),
title = "3D Scatter plot: Track_popularity vs Energy vs Valence",
showlegend = FALSE)
We can clearly see that the popularity of a song is directly related to it’s energy
We examined all the above factors and saw the correlation between various factors and how each factor is affecting popularity. Now, from a modeling perspective, I want find the factors which most affect the popularity and differentiates one song from another making it more or less popular than the other.
library(rpart)
## Warning: package 'rpart' was built under R version 3.6.2
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.6.2
tree_model <- rpart(track_popularity ~ loudness+valence+energy+key+danceability+acousticness+playlist_genre, data = spotify_arrange)
rpart.plot(tree_model, box.palette = "GnBu")
We see that out of the above factors of loudness, acousticness, playlist_genre, energy, valence, the factors genre plays a majoy role as 35% of the top songs are edm. Loudness plays a role as well as in songs with loudness > -7.4 have a good chance in being in the top songs.
These are the findings that I have observed from the analysis thus far on the spotify dataset.