1.Problem Statement
If music be the food of love, play on…
Over the centuries since Shakespeare,music has continued to be an indispensable aspect of our daily life.Not a day passes by without you humming a tune or listening to one.Our sources of listening to them have also evolved over time. From the vintage gramaphones through mix tapes through radio, our listening medium has now reached online music-streaming platforms. Spotify, Apple music, Google play etc are some of the most popular of these.
With over 36% market share among online music subscribers and having a base of over 100 million subscribers, Spotify occupies the top spot. As a music afficionado, this prompted me to dig deeper in to the spotify songs database to discover interesting trends regarding the songs, their artists and hopefully help people discover new music.
2.Data
The Spotify_data used for this analysis is a contiguous dataset of over 30,000 songs with 12 audio features for each track, including confidence measures like acousticness,liveness, speechiness and instrumentalness, perceptual measures like energy, loudness, danceablity and valence (positiveness), and descriptors like duration,key, tempo and mode.
3.The approach
A three pronged approach is taken here :
An overview of the genres and the artists
An analysis of the track popularity and how it extends aross genres
Based on the preferences of the user, build a simple system that gives out related songs based on genre and / or artist
Several univariate and multivariate analyses are done across variables post data preparation to substantiate the findings / results obtained by the above approaches
4.What can you do with the analysis
Identify the association between different types of songs
As a music aficionado, find the songs that were off your radar
If you love that genre, get a recommendation for it
Identify the popular songs, the popularity of artists etc
Since a song can be associated with multiple genres on Spotify, there may be cases of multiple instances of the same song appearing while analysing
The packages used in the project (currently) are:
data.table: To read the csv files in the fastest possible way
DT : Filtering, pagination, and sorting of data tables in html outputs
kableExtra : Manipulate table styles for good visualizations
knitr : Aligned displays of table in a html doc
stringr: String replacements and pattern matching
formattable: Allows to create aesthetic tables in R
plotly: Interactive graphing library
wordcloud: Creates wordclouds
RColorBrewer: offers several color palette for R
tidyselect: allows to create selecting verbs that are consistent with other tidyverse packages
tm: Used for text mining - creating Corpus
shiny: to create the shiny app
shinythemes: to alter shiny themes
tidyverse : Collection of R packages for data manipulation, exploration and visualization. I am currently using
# Loading the packages
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(stringr)
library(formattable)
library(plotly)
library(wordcloud)
library(RColorBrewer)
library(tidyverse)
library(tidyselect)
library(tm)
library(shiny)
library(shinythemes)
The data for the analysis comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. This data was gathered in Jan’2020
spotify <- read.csv('D:/Cinci prep/Coursework/Data Wrangling with R/spotify_songs.csv')
The attribute of the dataset and the column information is provided for the key variables.
#dimension of the dataset
dimension <- str_trim(paste0(dim(spotify), sep = " ", collapse = ""))
#column names of the dataset
vars <- str_trim(paste0(names(spotify), sep = " | ", collapse = ""))
# Creating a table
table_attributes <- data_frame(Data = 'Spotify_songs',
`Rows Columns` = dimension,
Variables = vars)
# Printing the table
kable(table_attributes, format = "html") %>%
kable_styling(bootstrap_options = "striped") %>%
column_spec(2, width = "12em")
Data | Rows Columns | Variables |
---|---|---|
Spotify_songs | 32833 23 | track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | |
We have 13 numeric variables (4 - integer, 9 - float) and 10 character variables in this dataset.
The next step in cleaning the data was to identify any missing values across columns. A quick is.na() check is used fo the puropse.
# Identifying the missing values across columns
miss <- colSums(is.na(spotify))
print(miss[miss>0])
## track_name track_artist track_album_name
## 5 5 5
Since these removing these observations are not expected to affect our approach, we remove these incomplete observations
# Identifying the missing values across columns
spotify <- na.omit(spotify)
# Removing irrelevant columns
spotify <- spotify %>% select(-c(playlist_id,playlist_name,track_album_id,playlist_subgenre))
The columns playlist_id,playlist_name,playlist_subgenre, track_album_id will not be used in the ensuing analysis. Hence we proceed to remove these columns as well
# Identifying duplicate values
print(paste("Number of duplicate observations : " , sum(duplicated(spotify))))
## [1] "Number of duplicate observations : 2448"
spotify <- spotify[!duplicated(spotify),]
print(paste("We have ",sum(complete.cases(spotify)),"Complete cases!"))
## [1] "We have 30380 Complete cases!"
#Converting df from wide to long for plotting the box plot of variable distributions
long_df <- gather(spotify,decriptive_vars, values, danceability, energy,
speechiness, acousticness, liveness,valence,
factor_key=TRUE)
#box plot of the audio features
var_distribution <- ggplot(long_df, aes(x = decriptive_vars, y = values)) +
geom_boxplot() +
coord_flip() +
labs(title = "Distribution across rest of the variables", x = "Audio features", y = "values")
key_box_plot <- ggplot(spotify, aes(y = key)) + geom_boxplot() + guides(fill=FALSE) + labs(title = 'Distribution of Key')
loudness_box_plot <- ggplot(spotify, aes(y = loudness)) + geom_boxplot() + guides(fill=FALSE) + labs(title = 'Distribution of loudness')
instrumentalness_box_plot <- ggplot(spotify, aes(y = instrumentalness)) + geom_boxplot() + guides(fill=FALSE) + labs(title = 'Distribution of instrumentalness')
tempo_box_plot <- ggplot(spotify, aes(y = tempo)) + geom_boxplot() + guides(fill=FALSE) + labs(title = 'Distribution of tempo')
duration_box_plot <- ggplot(spotify, aes(y = duration_ms)) + geom_boxplot() + guides(fill=FALSE) + labs(title = 'Distribution of duration_ms')
ggplotly(key_box_plot)
ggplotly(loudness_box_plot)
ggplotly(instrumentalness_box_plot)
ggplotly(tempo_box_plot)
ggplotly(duration_box_plot)
ggplotly(var_distribution)
There are common outliers in all the variables that have values equal to 0 across multiple variables.We remove those observations as well.
#removing outliers from the dataset
spotify <- spotify[!(spotify$duration_ms==4000),]
duration_outliers <- boxplot(spotify$duration_ms,
plot = FALSE, range = 4)$out
spotify <- spotify %>%
filter(!duration_ms %in% duration_outliers)
duration_new_plot <- spotify %>%
ggplot(aes(y = duration_ms)) +
geom_boxplot() +
coord_flip() +
labs(title = 'Duration, outliers removed')
duration_new_plot
We also remove outliers in the duration_ms column having values outside of the 4x range of values
We are creating 2 new columns
year - which will have the year of release of the album / song
durn_minutes - which will have the duration of the song in minutes
#creatiing a year columns
spotify$track_album_release_date <- as.character(spotify$track_album_release_date, "%m/%d/%Y")
spotify$year <- substr(spotify$track_album_release_date,1,4)
#Creating a duration in minutes column
spotify$durn_minutes <- spotify$duration_ms/(1000*60)
#dimension of the dataset
dimension <- str_trim(paste0(dim(spotify), sep = " ", collapse = ""))
#column names of the dataset
vars <- str_trim(paste0(names(spotify), sep = " | ", collapse = ""))
# Creating a table
table_attributes <- data_frame(Data = 'Spotify_songs',
`Rows Columns` = dimension,
Variables = vars)
# Printing the table
kable(table_attributes, format = "html") %>%
kable_styling(bootstrap_options = "striped") %>%
column_spec(2, width = "12em")
Data | Rows Columns | Variables |
---|---|---|
Spotify_songs | 30379 21 | track_id | track_name | track_artist | track_popularity | track_album_name | track_album_release_date | playlist_genre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | year | durn_minutes | |
datatable(head(spotify, 100))
The description of the variables are given below:
Data | Data Type | Variable description |
---|---|---|
track_id | factor | Song unique ID |
track_name | factor | Song Name |
track_artist | factor | Song Artist |
track_popularity | integer | Song Popularity (0-100) where higher is better |
track_album_name | factor | Song album name |
track_album_release_date | character | Date when album released |
playlist_genre | factor | playlist genre |
danceability | numeric | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1 is most danceable. |
energy | numeric | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
key | integer | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 2 = D, and so on. If no key was detected, the value is -1. |
loudness | numeric | The overall loudness of a track in decibels (dB). Loudness values are averaged across the full track and are useful for comparing relative loudness of track. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db |
mode | integer | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0 |
speechiness | numeric | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks |
acousticness | numeric | A confidence measure from 0.0 to 1.0 of whether the track is acoustic.1 represents high confidence the track is acoustic |
instrumentalness | numeric | Predicts whether a track contains no vocals. OOH and AAH sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly vocal. The closer the instrumentalness is to 1, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
liveness | numeric | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
valence | numeric | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
tempo | numeric | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
duration_ms | integer | Duration of song in milliseconds |
year | character | Year of release of the album |
durn_minutes | numeric | duration of song in minutes |
First, let us find out the distribution of songs across genres. What genre has the most number of songs in the dataset?
# songs per genre
spotify %>% group_by(Genre = playlist_genre) %>%
summarise(No_of_tracks = n()) %>% knitr::kable()
Genre | No_of_tracks |
---|---|
edm | 5537 |
latin | 4639 |
pop | 5132 |
r&b | 5138 |
rap | 5483 |
rock | 4450 |
EDM is the genre in which most songs have been released, followed by rap and then pop.
# artists with most releases
most_releases <- spotify %>% group_by(Artist = track_artist) %>%
summarise(No_of_tracks = n()) %>%
arrange(desc(No_of_tracks)) %>%
top_n(15, wt = No_of_tracks) %>%
ggplot(aes(x = Artist, y = No_of_tracks)) +
geom_bar(stat = "identity") +
coord_flip() + labs(title = "artists with the most releases", x = "artists", y = "no of releases")
ggplotly(most_releases)
With ~130 tracks in their name, Queen have been the busiest artists over time. David Guetta comes in second with ~90 tracks
#Create a vector containing only the text
text <- spotify$track_name
# Create a corpus
docs <- Corpus(VectorSource(text))
#clean text data
docs <- docs %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords,c("feat","edit","remix","remastered","remaster","radio","version","original","mix"))
#create a doument-term matrix
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
#generate the word cloud
wordcloud(words = df$word, freq = df$freq,scale=c(8,0.25), min.freq = 1,
max.words=150, random.order=FALSE, rot.per=0.25,
colors=brewer.pal(8, "Dark2"))
No word has been more associated with music than love. Love is the most frequently used word in track titles. Like, don’t, one etc are the other frequent ones
Is there something as a golden age of music? A span of years when a lot of songs were released? The graph below will tell us
# grouping tracks by years
plot_year <- spotify %>%
select(year) %>%
group_by(year) %>%
summarise(count = n())
#plotting releases across years
year_plot <- ggplot(plot_year,aes(x = year, y = count,group = 1)) +
geom_line() +
theme(legend.position = "none",axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Release of songs across years", x = "Year",
y = "No of songs released")
ggplotly(year_plot)
On further inspection, over 80% of the songs have been released in the 21st century. Let us get a clearer picture of this period, shall we?
# zooming into 21st century
plot_zoom_year <- spotify %>%
select(year) %>%
group_by(year) %>%
summarise(count = n()) %>%
subset(year >= 2000)
graph_zoom_year <- ggplot(plot_zoom_year,aes(x = year, y = count,group = 1)) +
geom_line() +
theme(legend.position = "none",axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Music in 21st Century", x = "Year",
y = "No of songs released")
ggplotly(graph_zoom_year)
The advent and popularity of internet may have led to this remarkable spike in the number of songs released post 2000. The dip in 2020 is because we only have data for less than a month in 2020.
Now let us really start exploring the genres, understand the features that characterise a genre
#DF of characteristics of genres
genre_description <- spotify %>%
group_by(Genre = playlist_genre) %>%
summarise(Danceability = median(danceability),
Energy = median(energy),
Key = median(key),
Loudness = median(loudness),
Mode = median(mode),
Speechiness = median(speechiness),
Acousticness = median(acousticness),
Instrumentalness = median(instrumentalness),
Liveness = median(liveness),
Valence = median(valence),
Tempo = median(tempo),
Duration = median(durn_minutes))
kable(genre_description , format = "html") %>%
kable_styling(bootstrap_options = "striped") %>%
column_spec(2, width = "12em")
Genre | Danceability | Energy | Key | Loudness | Mode | Speechiness | Acousticness | Instrumentalness | Liveness | Valence | Tempo | Duration |
---|---|---|---|---|---|---|---|---|---|---|---|---|
edm | 0.661 | 0.8270 | 6 | -5.0270 | 1 | 0.06060 | 0.02040 | 3.75e-03 | 0.138 | 0.376 | 127.0040 | 3.406967 |
latin | 0.726 | 0.7270 | 6 | -5.8620 | 1 | 0.06620 | 0.13800 | 3.00e-06 | 0.121 | 0.620 | 112.0830 | 3.513850 |
pop | 0.650 | 0.7270 | 5 | -5.8740 | 1 | 0.04890 | 0.07595 | 1.38e-05 | 0.123 | 0.499 | 120.0255 | 3.517558 |
r&b | 0.688 | 0.5990 | 6 | -7.4125 | 1 | 0.06785 | 0.16400 | 4.80e-06 | 0.120 | 0.542 | 108.8660 | 3.870667 |
rap | 0.734 | 0.6650 | 6 | -6.5180 | 1 | 0.17500 | 0.11100 | 0.00e+00 | 0.128 | 0.516 | 119.9990 | 3.500000 |
rock | 0.523 | 0.7785 | 5 | -6.9510 | 1 | 0.04200 | 0.03680 | 2.05e-04 | 0.137 | 0.530 | 124.0135 | 3.947783 |
As expected, EDM is the “high energy genre”. Rap is also a popular party genre with 0.73 danceability (top rank). EDM, rock and pop have the highest tempo
Now that we have had an overview of the genres and related aspects in the previous section, let us find out if there is anything discernible about the popularity of tracks
#finding popular artists
popular_artists <- spotify %>% group_by(Artist = track_artist) %>%
summarise(No_of_tracks = n(),Popularity = mean(track_popularity)) %>%
filter(No_of_tracks > 2) %>%
arrange(desc(Popularity)) %>%
top_n(15, wt = Popularity) %>%
ggplot(aes(x = Artist, y = Popularity)) +
geom_bar(stat = "identity") +
coord_flip() + labs(title = "popular artists overall", x = "Artists", y = "Popularity")
ggplotly(popular_artists)
Trevor Daniel, Y2K and Don Toliver are the most popular artists. I have given a condition of artists having minimum 2 credits to their name so as to eliminate “one hit wonders”
# top artists in each genre
top_artists_genre <- spotify %>%
group_by(Genre = playlist_genre, Artist = track_artist) %>%
summarise(No_of_tracks = n(), Popularity = mean(track_popularity)) %>%
filter(No_of_tracks > 2) %>%
arrange(desc(Popularity)) %>%
top_n(1, wt = Popularity)
kable(top_artists_genre , format = "html") %>%
kable_styling(bootstrap_options = "striped") %>%
column_spec(2, width = "12em")
Genre | Artist | No_of_tracks | Popularity |
---|---|---|---|
latin | Billie Eilish | 3 | 91.66667 |
pop | Travis Scott | 3 | 89.00000 |
r&b | XXXTENTACION | 3 | 86.33333 |
edm | MEDUZA | 3 | 85.33333 |
rap | JACKBOYS | 3 | 84.33333 |
rock | MGMT | 3 | 74.66667 |
#popularity among genres
rating_plot <- ggplot(spotify, aes(x = playlist_genre, y = track_popularity)) +
geom_boxplot() +
coord_flip() +
labs(title = "Popularity across genres", x = "Genres", y = "Popularity")
ggplotly(rating_plot)
Pop has the highest median popularity among the genres. EDM has least median popularity
For better clarity, I am only considering years 2000 & above
#poplarity movement across years
pop_across_years <- spotify %>%
group_by(playlist_genre, year) %>%
summarise(avg = mean(track_popularity) )%>%
subset(year >=2000)
year_graph <- pop_across_years %>%
ggplot(aes(x = year, y = avg,
group = playlist_genre, color = playlist_genre)) +
geom_line() + labs(title = "21st Century", x = "Year of release",
y = "Average popularity") +
theme(legend.position = "none",
axis.text.x = element_text(angle = 90, hjust = 1))
year_graph
We don’t see any discernible pattern here.
I wanted to see if the old songs were popular than the new ones. I have taken songs released from 1950s onwards.We see that, barring a significant dip during the decade of 2000 - 2010, the mean popularity has not changed much over the years
trend <- spotify %>%
group_by(year) %>%
summarise(num_songs = n(), rating = sum(track_popularity)/n()) %>%
ungroup() %>%
ggplot(aes(x = year, y = rating, group = 1)) +
geom_line() +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Rating vs. Year of release", x = "Year of release", y = "Average popularity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly(trend)
Who are the people in your life who point you in the direction of new artists, songs and even new styles of music?
I have created a shiny app which will collect the user preferences :
by genre - to list out top songs in the genre based on a customizable rating scale
by artist - to list out top songs of the artist based on a customizable rating scale
The shiny app is hosted here
shinyUI(navbarPage(theme = shinytheme("cosmo"),"Song recommender",
tabPanel("Based on Genre",
sidebarPanel(
# Genre Selection
selectInput(inputId = "Columns", label = "Which genres do you like?",
unique(songs$playlist_genre), multiple = FALSE),
verbatimTextOutput("rock"),
sliderInput(inputId = "range", label = "Ragne of Ratings that you wish to read?",
min = min(songs$track_popularity),max = 100,value = c(55,100))
),
mainPanel(
h2("Top songs of the genre"),
DT::dataTableOutput(outputId = "songsreco")
)
),
tabPanel("Based on Artist",
sidebarPanel(selectInput(inputId = "singers", label = "Which singer do you like?",
unique(songs$track_artist), multiple = FALSE),
verbatimTextOutput("Ed Sheeran"),
sliderInput(inputId = "range_2", label = "Ragne of Ratings that you wish to read?",
min = min(songs$track_popularity),max = 100,value = c(55,100))),
mainPanel(
h2("Top songs of the artist"),
DT::dataTableOutput(outputId = "songsreco_artist")))))
shinyServer(function(input, output) {
datasetInput <- reactive({
# Filtering the books based on genre and rating
songs %>% filter(playlist_genre %in% as.vector(input$Columns)) %>%
group_by(track_name) %>% filter(track_popularity >= as.numeric(input$range[1]), track_popularity <= as.numeric(input$range[2])) %>%
arrange(desc(track_popularity)) %>%
select(track_name, track_artist, track_popularity, playlist_genre) %>%
rename(`song` = track_name, `Genre(s)` = playlist_genre)
})
datasetInput2 <- reactive({
# Filtering the books based on genre and rating
songs %>% filter(track_artist %in% as.vector(input$singers)) %>%
group_by(track_name) %>% filter(track_popularity >= as.numeric(input$range_2[1]), track_popularity <= as.numeric(input$range_2[2])) %>%
arrange(desc(track_popularity)) %>%
select(track_name, track_artist, track_popularity, playlist_genre) %>%
rename(`song` = track_name, `Genre(s)` = playlist_genre)
})
#Rendering the table
output$songsreco <- DT::renderDataTable({
DT::datatable(head(datasetInput(), n = 50), escape = FALSE, options = list(scrollX = '1000px'))
})
output$songsreco_artist <- DT::renderDataTable({
DT::datatable(head(datasetInput2(), n = 100), escape = FALSE, options = list(scrollX = '1000px'))
})
})
I hope the insights given in the previous sections have been informative to you. I also hope that you have explored the app created and gotten hold of atleast a couple of new songs that you might want to try out.
Given below is a summary of the findings that I have come across during the exploration of this dataset
EDM is the genre in which most songs have been released, followed by rap and then pop.
Pop has the highest median popularity among the genres. EDM has least median popularity
EDM is the “high energy genre”. Rap is also a popular party genre with 0.73 danceability (top rank). EDM, rock and pop have the highest tempo
EDM and Rock can be easily demarcated from the other genres. . Latin is poles apart from EDM and rock
Barring a significant dip in mean popularityduring the decade of 2000 - 2010, the mean popularity has not changed much over the years
With ~130 tracks in their name, Queen have been the busiest artists over time. David Guetta comes in second with ~90 tracks
Trevor Daniel, Y2K and Don Toliver are the most popular artists. I have given a condition of artists having minimum 2 credits to their name so as to eliminate “one hit wonders”
Love is the most frequently used word in track titles. Like, don’t, one etc are the other frequent ones
Finding track popularity using the given predictors. Prominent factors in SVM / Clustering can be used to identify any strong indicators on how a song gets a good popularity
A recommendation engine based on the genre selected (being slightly ambitious here)