Presented by - Daya Nayak, Shubham Jha, Maulik Patel
Spotify is a Swedish-based audio streaming and media services provider, which launched in October 2008. It is now one of the biggest digital music, podcast, and streaming service in the world that gives access to millions of songs from artists all over the world. It has also started producing music albums and events across the world.
Spotify offers over 70 million tracks, and 2 million podcasts to more than 300 million monthly users. As a freemium service, basic features are free with advertisements and limited control, while additional features, such as offline listening and commercial-free listening, are offered via paid subscriptions. Users can search for music based on track, artist, album, or genre, and can create, edit & share playlists.
The scope of the project is to determine whether the popularity of a song is based on genre/sub-genre and audio features like loudness, speechiness, danceability etc. This analysis can provide insights on what features make a song popular and can also help in music recommendation to the users. We would also like to try finding groups among the songs, based on it’s features. This would in-turn help the user in creating the playlist, by suggesting him/her the songs, based on the selected song.
We plan to explore following aspects for the data analysis:
What we hope to Achieve with the analyses?
These insights will help us in coming up with a model to predict popularity score of the song given its features. Helpful when composing music to see what factors play heavily into popularity. This could help Artists on what factors to focus on for maximising chance at Popularity.
Build a simple recommender-system that suggests similar songs based on the user’s preferences and listening habits. Helpful in discovering new songs and playlist creation.
The following packages are used in the analysis:
Tidyverse - Collection of R packages for data manipulation, exploration and visualization.
ggplot2 - Used for plotting charts.
plotly - For web-based graphs via the open source JavaScript graphing library plotly.js for interactive charts
factoextra - To visualize the output of multivariate data analysis
funModeling - Exploratory Data Analysis and Data Preparation Tool-Box
RColorBrewer - To help you choose sensible colour schemes for figures in R
ggplot2 - ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.
Lubridate - It is a package that eases working with Date and Time datatypes
Knitr - it is a package in the statistical programming language R that enables integration of R code into LaTeX, LyX, HTML, Markdown, AsciiDoc, and reStructuredText documents
DT - Data objects in R can be rendered as HTML by importing this package.
cowplot - For providing addition functionalities to ggplot.
wordcloud - Creates wordclouds
corrplot - It is used for creating correlation matrix, to find colinearity between different features
kableExtra - allows users to construct complex tables and customize styles using a readable syntax.
imager - allows to load images from a publically available URL.
The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata around songs from Spotify’s API.
A subset of the data had already been extracted and is available for access on Github on which the analysis has been done. The song database consists of songs, its popularity, artists, the album to which the song belongs to from 6 main genres (EDM, Latin, Pop, R&B, Rap, and Rock) from Jan 1957 to Jan 2020.
Reading the Data from the source file
spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
The dataset used comprises of around 32,833 songs (with 23 features) on Spotify from 1957-2020 and was updated on 1/21/2020. This medium article does a great job of explaining the various “audio features” that Spotify links to a song.
Here is the data dictionary as a reference:
### Checking dimension of Data
dim(spotify_songs)
## [1] 32833 23
### Checking structure of Data
str(spotify_songs)
## spec_tbl_df [32,833 x 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ track_id : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr [1:32833] "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num [1:32833] 122 100 124 122 124 ...
## $ duration_ms : num [1:32833] 194754 162600 176616 169093 189052 ...
## - attr(*, "spec")=
## .. cols(
## .. track_id = col_character(),
## .. track_name = col_character(),
## .. track_artist = col_character(),
## .. track_popularity = col_double(),
## .. track_album_id = col_character(),
## .. track_album_name = col_character(),
## .. track_album_release_date = col_character(),
## .. playlist_name = col_character(),
## .. playlist_id = col_character(),
## .. playlist_genre = col_character(),
## .. playlist_subgenre = col_character(),
## .. danceability = col_double(),
## .. energy = col_double(),
## .. key = col_double(),
## .. loudness = col_double(),
## .. mode = col_double(),
## .. speechiness = col_double(),
## .. acousticness = col_double(),
## .. instrumentalness = col_double(),
## .. liveness = col_double(),
## .. valence = col_double(),
## .. tempo = col_double(),
## .. duration_ms = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
### summarising the Data set and features
summary(spotify_songs)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
### checking for NULLs or missing values
colSums(is.na(spotify_songs))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
Null values: As we can see that the track_name,track_album_name and track_artist variables contain 5 missing values. Out of 32833 rows, we can remove 5 rows without any significant impact on our data.
Duplicate Data: We observed that some of the songs have been repeated more than once in this dataset. Out of 32,833 songs, only 28,352 songs are unique. They have the same ‘track_id’ but have a different ‘playlist_id’. So we need to remove those duplicated songs in the dataset. Since the song’s ‘track_id’ is the unique identifier for a song and the other numeric and categorical features of that song remains the same, we will delete those duplicated songs based on the ‘track_id’.
#### Removing NULL values from the data
spotify_songs <- na.omit(spotify_songs)
#### Changing datatype of some categorical columns from string to factor.
#### This is done as factors are less in number compared to regular string (e.g. song name)
#### and factors are used for categorical data analysis
spotify_songs <-spotify_songs %>%
mutate(playlist_genre=as.factor(spotify_songs$playlist_genre),
playlist_subgenre=as.factor(spotify_songs$playlist_subgenre),
mode=as.factor(mode),
key=as.factor(key))
#### removing duplicated data
spotify_songs <- spotify_songs[!duplicated(spotify_songs$track_id),]
dim(spotify_songs)
## [1] 28352 23
### summarising the Data set and features
summary(spotify_songs)
## track_id track_name track_artist track_popularity
## Length:28352 Length:28352 Length:28352 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 21.00
## Mode :character Mode :character Mode :character Median : 42.00
## Mean : 39.34
## 3rd Qu.: 58.00
## Max. :100.00
##
## track_album_id track_album_name track_album_release_date
## Length:28352 Length:28352 Length:28352
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## playlist_name playlist_id playlist_genre
## Length:28352 Length:28352 edm :4877
## Class :character Class :character latin:4136
## Mode :character Mode :character pop :5132
## r&b :4504
## rap :5398
## rock :4305
##
## playlist_subgenre danceability energy
## southern hip hop : 1582 Min. :0.0000 Min. :0.000175
## indie poptimism : 1547 1st Qu.:0.5610 1st Qu.:0.579000
## neo soul : 1478 Median :0.6700 Median :0.722000
## progressive electro house: 1460 Mean :0.6534 Mean :0.698372
## electro house : 1416 3rd Qu.:0.7600 3rd Qu.:0.843000
## gangster rap : 1314 Max. :0.9830 Max. :1.000000
## (Other) :19555
## key loudness mode speechiness acousticness
## 1 : 3436 Min. :-46.448 0:12318 Min. :0.0000 Min. :0.0000
## 0 : 3001 1st Qu.: -8.310 1:16034 1st Qu.:0.0410 1st Qu.:0.0143
## 7 : 2907 Median : -6.261 Median :0.0626 Median :0.0797
## 9 : 2631 Mean : -6.818 Mean :0.1079 Mean :0.1772
## 11 : 2577 3rd Qu.: -4.709 3rd Qu.:0.1330 3rd Qu.:0.2600
## 2 : 2478 Max. : 1.275 Max. :0.9180 Max. :0.9940
## (Other):11322
## instrumentalness liveness valence tempo
## Min. :0.0000000 Min. :0.0000 Min. :0.0000 Min. : 0.00
## 1st Qu.:0.0000000 1st Qu.:0.0926 1st Qu.:0.3290 1st Qu.: 99.97
## Median :0.0000207 Median :0.1270 Median :0.5120 Median :121.99
## Mean :0.0911294 Mean :0.1910 Mean :0.5104 Mean :120.96
## 3rd Qu.:0.0065725 3rd Qu.:0.2490 3rd Qu.:0.6950 3rd Qu.:134.00
## Max. :0.9940000 Max. :0.9960 Max. :0.9910 Max. :239.44
##
## duration_ms
## Min. : 4000
## 1st Qu.:187741
## Median :216933
## Mean :226575
## 3rd Qu.:254975
## Max. :517810
##
Redundant Columns : Now since we don’t have duplicate records, and we would like to analyze which features influence the ‘track_popularity’, we can drop the following columns which are not useful in our analyses:
#### Dropping Redundant Columns
spotify_songs <- spotify_songs %>% select(-c(track_id, track_album_id,
track_album_name,
playlist_id, playlist_name,
playlist_subgenre))
Data Manipulation :
spotify_songs$track_album_release_date <- as.character(spotify_songs$track_album_release_date, "%m/%d/%Y")
spotify_songs$year <- substr(spotify_songs$track_album_release_date,1,4)
#### changing data type of year column
spotify_songs$year <- as.numeric(spotify_songs$year)
### Checking structure of Data
str(spotify_songs)
## tibble [28,352 x 18] (S3: tbl_df/tbl/data.frame)
## $ track_name : chr [1:28352] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr [1:28352] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : num [1:28352] 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_release_date: chr [1:28352] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_genre : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ danceability : num [1:28352] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num [1:28352] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : Factor w/ 12 levels "0","1","2","3",..: 7 12 2 8 2 9 6 5 9 3 ...
## $ loudness : num [1:28352] -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 2 2 ...
## $ speechiness : num [1:28352] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num [1:28352] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num [1:28352] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num [1:28352] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num [1:28352] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num [1:28352] 122 100 124 122 124 ...
## $ duration_ms : num [1:28352] 194754 162600 176616 169093 189052 ...
## $ year : num [1:28352] 2019 2019 2019 2019 2019 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 8152 9283 9284 19569 19812
## ..- attr(*, "names")= chr [1:5] "8152" "9283" "9284" "19569" ...
A preview of the clean data-set is given below:
### displaying top 100 rows
output_data <- head(spotify_songs, n = 100)
datatable(output_data, filter = 'top', options = list(pageLength = 25))
Let’s see the distribution of songs across different genres. Which genre has the most number of songs in the dataset?
# songs per genre
spotify_songs %>% group_by(Genre = playlist_genre) %>%
summarise(No_of_tracks = n()) %>%
arrange(desc(No_of_tracks)) %>% knitr::kable()
Genre | No_of_tracks |
---|---|
rap | 5398 |
pop | 5132 |
edm | 4877 |
r&b | 4504 |
rock | 4305 |
latin | 4136 |
Rap is the genre in which most songs have been released, followed by Pop and then EDM.
# artists with most releases
highest_tracks <- spotify_songs %>% group_by(Artist = track_artist) %>%
summarise(No_of_tracks = n()) %>%
arrange(desc(No_of_tracks)) %>%
top_n(15, wt = No_of_tracks) %>%
ggplot(aes(x = Artist, y = No_of_tracks)) +
geom_bar(stat = "identity") +
coord_flip() + labs(title = "Artists With The Most Track Releases", x = "Artist", y = "# of Tracks")
ggplotly(highest_tracks)
Queen(Tracks = 130), Martin Garrix(Tracks = 87), Don Omar (Tracks = 84) are one of the Top Artists, with the Most Track Releases across the years. The Top 15 Artists are shown here.
#Create a vector containing only the text
name <- spotify_songs$track_name
# Create a corpus
corpus_ <- Corpus(VectorSource(name))
#clean text data - remove suffix and adjectives
corpus_ <- corpus_ %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
corpus_ <- tm_map(corpus_, content_transformer(tolower))
corpus_ <- tm_map(corpus_, removeWords, stopwords("english"))
corpus_ <- tm_map(corpus_, removeWords,c("feat","edit","remix","remastered","remaster","radio","version","original","mix"))
#create a document-term matrix
dtm <- TermDocumentMatrix(corpus_)
dtm_matrix <- as.matrix(dtm)
words <- sort(rowSums(dtm_matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
#generate the word cloud
wordcloud(words = df$word, freq = df$freq,scale=c(8,0.25), min.freq = 1,
max.words=150, random.order=FALSE, rot.per=0.25,
colors=brewer.pal(8, "Dark2"))
Love is the most frequently used word in the title of the song followed by don’t and like.
#popularity among genres
popularity_vs_genre_plot<- ggplot(spotify_songs, aes(x = playlist_genre, y =
track_popularity)) +
geom_boxplot() +
coord_flip() +
labs(title = "Popularity across genres", x = "Genres", y = "Popularity")
ggplotly(popularity_vs_genre_plot)
Based on the Median values, it can be seen that the Pop is the most popular genre amongst the others. It is closely followed by latin and rap.
# grouping tracks by years
tracks_year <- spotify_songs %>%
select(year) %>%
filter(year<2020) %>%
group_by(year) %>%
summarise(count = n())
#plot of tracks released across the years
tracks_vs_year <- ggplot(tracks_year,aes(x = year, y = count,group = 1)) +
geom_line() +
theme(legend.position = "none",axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Release of songs across years", x = "Year",
y = "No of songs released")
ggplotly(tracks_vs_year)
We see that almost 75% of the songs were released in 21st century. The adent of the internet and the audio streaming services may have caused this drastic increase in the production of songs in the 21st Century.
Are the same genre of songs popular over the years or has the people’s taste in music changed over the years?
## Find popular genres over the decades
spotify_1 <- spotify_songs %>%
select(track_popularity,year,playlist_genre) %>%
mutate(year = as.numeric(spotify_songs$year), decade = year - (year %% 10))
spotify_2 <- spotify_1 %>%
filter(track_popularity > 50) %>%
group_by(decade, playlist_genre) %>%
summarise(count = n())
decadewise_tracks_genre <- spotify_2 %>%
group_by(decade) %>%
ggplot(aes(fill = playlist_genre, x = decade, y = count)) +
geom_bar(position= "stack", stat = "identity") +
labs(title = "Popular genre over the decades", x = "Decade", y = "Popularity of Genre")
ggplotly(decadewise_tracks_genre)
Rock music was quite popular in earlier decades of 1960-70s where as pop songs are most popular during 2010.It shows a drastic change in people’s choice of songs in later 2010 from 1960-70s.
##### Correlation plot for numeric columns
corr_spotify <- spotify_songs %>%
select(track_popularity, danceability, energy, loudness, speechiness,
acousticness, instrumentalness, liveness, valence, tempo, duration_ms)
corrplot(cor(corr_spotify),type="lower")
We can observe from the correlation matrix that Loudness & Energy have a moderate-to-strong positive correlation. Similarly, Acousticness & Energy, and Acousticness & Loudness have a negative correlation. All other features seems statistically linearly independent.
How do the genres correlate with each other? We will calculate the median feature values of each genre and then compute correlation between them to find out.
# average features by genre
avg_feature_genre <- spotify_songs %>%
group_by(playlist_genre) %>%
summarise_if(is.numeric, median, na.rm = TRUE) %>%
ungroup()
avg_genre_cor <- avg_feature_genre %>%
select(track_popularity, danceability, energy, loudness, speechiness,
acousticness, instrumentalness, liveness, valence, tempo, duration_ms) %>%
scale() %>%
t() %>%
as.matrix() %>%
cor()
colnames(avg_genre_cor) <- avg_feature_genre$playlist_genre
row.names(avg_genre_cor) <- avg_feature_genre$playlist_genre
avg_genre_cor %>% corrplot::corrplot(method = 'color',
order = 'hclust',
type = 'upper',
tl.col = 'black',
diag = FALSE,
addCoef.col = "grey40",
number.cex = 0.75,
col = colorRampPalette(colors = c(
'red',
'white',
'darkblue'))(200),
mar = c(2,2,2,2),
main = 'Correlation Between Median Genre
Feature Values',
family = 'Avenir')
We observe that EDM and Rock are negatively correlated with all other audio features except with each other. Latin and R&B are most similar to each other with correlation of 0.37, while EDM and R&B is the most different with a negative correlation of -0.69.
# measuring variation trend of popularity with song/track features
song_features <- c('danceability', 'energy', 'loudness', 'speechiness',
'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms')
popularity_features<- spotify_songs %>%
select(c('track_popularity', all_of(song_features))) %>%
group_by(track_popularity) %>%
summarise_if(is.numeric, mean) %>%
ungroup()
plot_list = list()
for (i in 1:length(song_features)) {
plot_list[[i]] = ggplot(popularity_features, aes_string(x = "track_popularity",
y = song_features[i])) +
geom_point(shape = 20, color = 3) +
geom_smooth(method = lm, linetype = "dashed", color = "blue", se = F) +
xlab("Track Popularity")
}
suppressMessages(do.call(grid.arrange,
c(plot_list, list(top = "variation trend of popularity with Track features"))))
The danceability seems to be increasing with increasing popularity. Also the instrumentalness of the song is very low for the popular songs. It can be also noticed that the popular songs are generally shorter in duration.
#correlation between valence and song popularity
trackpopularity_vs_valence <- ggplot(spotify_songs, aes(valence, track_popularity)) +
geom_jitter(color = "orange", alpha = 0.5) + theme_light()+
geom_smooth(color = 'black')
trackpopularity_vs_valence
The song popularity remains almost same for high and low valence. This suggests that both happy or cheerful and sad or depressed songs are equally popular among the users.
Plotting The Average Energy Levels for Each Genre
boxplot(energy~playlist_genre, data = spotify_songs,
main = "Energy Levels per Genre",
xlab = "Energy Level",
ylab = "Genre",
col = "Yellow",
border = "black",
horizontal = FALSE
)
R&B genre songs generally have a very low energy as compared to any other genre. And Rock and EDM genres generally have the highest energy levels.
boxplot(liveness~playlist_genre, data = spotify_songs,
main = "Liveness per Genre",
xlab = "Liveness",
ylab = "Genre",
col = "Green",
border = "black",
horizontal = TRUE
)
boxplot(valence~playlist_genre, data = spotify_songs,
main = "Valence per Genre",
xlab = "Valence",
ylab = "Genre",
col = "Pink",
border = "black",
horizontal = TRUE
)
Plotting Histograms of all the characteristics with the main song
plot_num(spotify_songs[,])
Instrumentalness seems to be at the lowest level for a majority of the songs. Acousticness and speechiness is also skewed towards the lower levels.
Loudness & Acousticness vs Energy Scatter Plot
s1 <- spotify_songs %>% ggplot(aes(energy,loudness)) +
geom_point(color = 'green', alpha = 0.1, shape = 1) +
geom_smooth(color = 'black')
s2 <- spotify_songs %>% ggplot(aes(energy,acousticness)) +
geom_point(color = 'red', alpha = 0.1, shape = 1) +
geom_smooth(color = 'black')
s3 <- spotify_songs %>% ggplot(aes(energy,instrumentalness)) +
geom_point(color = 'blue', alpha = 0.1, shape = 1) +
geom_smooth(color = 'black')
s4 <- spotify_songs %>% ggplot(aes(energy,liveness)) +
geom_point(color = 'yellow', alpha = 0.1, shape = 1) +
geom_smooth(color = 'black')
#Plotting Variations of Loudness, Acousticness, Instrumentalness, and Liveness with Energy
ggarrange(s1,s2,s3,s4)
While comparing the correlation of the other features with the energy feature, we can see that the acousticness is inversely proportional to energy. Loudness is also high for highly energetic songs.
trend_chart <- function(arg){
trend_change <- spotify_songs %>% filter(year>2010) %>% group_by(year) %>% summarize_at(vars(all_of(arg)), funs(Average = mean))
chart <- ggplot(data = trend_change, aes(x = year, y = Average)) +
geom_line(color = "bLack", size = 1) +
scale_x_continuous(breaks=seq(2011, 2020, 1)) + scale_y_continuous(name=paste("",arg,sep=""))
return(chart)
}
trend_chart_track_popularity<-trend_chart("track_popularity")
trend_chart_danceability<-trend_chart("danceability")
trend_chart_energy<-trend_chart("energy")
trend_chart_loudness<-trend_chart("loudness")
trend_chart_duration_ms<-trend_chart("duration_ms")
trend_chart_speechiness<-trend_chart("speechiness")
plot_grid(trend_chart_track_popularity, trend_chart_danceability, trend_chart_energy, trend_chart_loudness, trend_chart_duration_ms, trend_chart_speechiness,ncol = 2, label_size = 1)
A clearly observable trend can be seen in the duration, as in over the years the duration of songs has rapidly decreased. The danceability of the songs can be seen to be increasing over the years.
‘Love’ is the most popular ‘title’ used in the songs, which is closely followed by “Don’t” and “Like”.
Popular songs are just 2.5 - 4 minutes long.
Over the years, the duration of the songs has also decreased drastically.
In 1960s and 1970s, the ‘Rock’ genre was more popular as compared to the other genres. But over the years, the ‘Pop’ genre has taken over the popularity.
Instrumentalness is lower than 0.1 in a majority of observations.
Valence, Energy and Danceability seem to be normally distributed.
Song’s with higher speechiness are generally not popular.
rm(dtm_matrix)
spotify_reduced <- spotify_songs %>% select(-c(track_artist, track_name, track_album_release_date,
playlist_genre, mode))
spotify_reduced$key <- as.numeric(spotify_reduced$key)
spotify_reduced <- as.data.frame(scale(spotify_reduced))
Taking 5670 as optimal value and creating cluster
clust <- kmeans(spotify_reduced, 5670)
# final data -
spotify_songs_final <- cbind(spotify_songs, cluster_num = clust$cluster)
Checking goodness of fit for clusters
We know Total SSE = Within Cluster SSE + Between Clusters SSE. If maximum SSE is captured by between clusters, and within clusters SSE is minimized, our clustering is a good fit checking the $ of SSE captured by between Cluster SSE
print(paste0(round(clust$betweenss/clust$totss, 4)*100, "%"))
## [1] "90.99%"
Filtering out the songs which couldnt be clustered or have less neighbors
t <- spotify_songs_final %>% group_by(cluster_num) %>% summarise(n = n())
spotify_songs_final_songs <- spotify_songs_final %>% group_by(cluster_num) %>% filter(n()>3)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
Song Suggestion
Sample Case 1: Imagine you are listening to “I Don’t Care” by Justin Beiber and Ed Sheeran. Pretty popular and a great song for casual listening. Let’s see what are the song suggestions we receive for this particular song.
We see that there are two tracks of same name, only difference is their popularity. They are indeed clustered together by our K-means algorithm.
Based on songs audio features like Acousticness, Daceabiliy, Speechyness, Instrumentalness, etc, our algorithm will suggest songs which are musically closer to “I Don’t Care by Ed Sheeran”. Let’s look at suggested songs -
Let’s look a couple more examples.
And Corresponding Suggestions :
And Corresponding Suggestions :
And Corresponding Suggestions :
Summary
Good clusters has high similarity characteristics i.e. low Within Cluster SS, and maximum dissimilarity in characteristics between clusters i.e. high Between SS. In summary, we can measure the Between SS / Total SS ratio, if that is close to 1 (100%) means the clustering fits the data well. This has been proved true for our case as the ratio is at 91%, which indicates that the clustering is pretty accurate.
Limitations