“I am not an inventor , I just want to make things better” - Daniel Ek (Co-founder & CEO, Spotify)
Everybody knows about it. Spotify has seemingly taken the world by storm the past few years, recently reaching with 345 million users ,including 150 million premium subscribers. After its launch in 2008, the product has grown so much, and it has been amazing to see what it has become today. Their bread and butter is a library of millions of songs (over 40 million) and a massive number of playlists created by both mobile app users and Spotify’s own algorithm system. From its mobile machine-learning, artificial intelligence and data sifting technology, Spotify analyses your listening habits and builds out customized recommendations. This includes playlists and music suggestions based on the genres and artists you are listening to regularly.
Why am I interested in this?
I have been an active Spotify user myself since 2019. Being so, I have always had a curiosity to know how could Spotify classify songs into such broad genres? What is the feature of each genre, and how features of a song can determine its genre? I have decided to analyze the Spotify dataset to have a greater understanding of the type of genres, tracks, and artists the consumers have been listening to on Spotify. Also, to analyze the characteristics that effects the popularity of a track and to find the trend in music popularity over the years.
Objectives:
The main objectives I am having for this analysis are:
Identify the most popular tracks,artists and genre on Spotify.
Identify characteristics that effects the popularity of a track.
Group tracks based on their characteristics.
Analyze the correlation between various characteristics of a track.
Identify trends in music affinity of listeners over the years.
My methodology is to use the Spotify dataset available publicly. First, the data would be cleaned for a better quality by removing NULL and duplicate values and then analyzing data using various univariate and multivariate techniques to achieve our objectives. I would then present the findings in a well explained manner using Exploratory Data Analysis. I would also be using regression and clustering techniques to have a better understanding of the relationship between various characteristics.
I believe this analysis can help artists to understand what their audience is looking for and help them improve the popularity of their tracks. It can also help music distributors to streamline their music library. Additionally, It can help the Spotify team to have a better targeted content distribution by knowing different cluster analysis.
The packages I have used for my analysis are mentioned below:
library(dplyr)
library(tidyverse)
library(ggplot2)
library(DT)
library(knitr)
library(kableExtra)
library(wordcloud)
library(treemap)
library(ggcorrplot)
library(formattable)
library(GGally)
library(purrr)
library(viridis)
library(forcats)
library(corpus)
library(tm)
library(RColorBrewer)
library(cowplot)
library(plotly)
library(nnet)
library(shiny)
library(shinythemes)
library(gridExtra)
The source data comes from Spotify via the spotifyr package which can be downloaded by clicking here.Charlie Thompson, Josia Parry, Donal Phipps, and Tom Wolff authored this package.The main purpose of the package was to obtain general metadata for songs (from Spotify’s API) in an easier fashion. The data contains track details from 1960 to 2020.
As the first step,importing the dataset:
#importing dataset
spotify <- read.csv("C:/Users/arunp/Desktop/UC/ACADEMICS/7025-DATA WRANGLING/MIDTERM PROJECT/spotify_songs.csv")
Let us check for the dimensions of our spotify dataset:
#checking dimensions of the dataset
dim(spotify)
## [1] 32833 23
As we can see, the dataset has 32833 observations and 23 variables.The dataset contains 15 missing values and these values have not been imputed in the original dataset.
Below is the detailed data dictionary to understand all the variables present in the dataset:
#loading data dictionary
spotify_dict <- read.csv("C:/Users/arunp/Desktop/UC/ACADEMICS/7025-DATA WRANGLING/MIDTERM PROJECT/spotify_dict.csv")
spotify_dict%>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed", "responsive"), full_width = F)
| variable | class | description |
|---|---|---|
| track_id | character | Song unique ID |
| track_name | character | Song Name |
| track_artist | character | Song Artist |
| track_popularity | double | Song Popularity (0-100) where higher is better |
| track_album_id | character | Album unique ID |
| track_album_name | character | Song album name |
| track_album_release_date | character | Date when album released |
| playlist_name | character | Name of playlist |
| playlist_id | character | Playlist ID |
| playlist_genre | character | Playlist genre |
| playlist_subgenre | character | Playlist subgenre |
| danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C?/D?, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | double | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | double | Duration of song in milliseconds |
We would first have a look at the structure of the dataset:
#analyzing the structure of the dataset
str(spotify)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
All the variables of the data are in the required classes. So, we do not need to make any changes.Let us now look for the columns containing missing values.
#Identifying missing values across columns
col_miss <- colSums(is.na(spotify))
print(col_miss[col_miss>0])
## track_name track_artist track_album_name
## 5 5 5
As the number of missing values is negligibly small compared to the total number of observations , we can remove these incomplete observations from our data.
#removing missing values
spotify <- na.omit(spotify)
We have to now check for any duplicate observations in our data.
#find number of duplicate values
duplicate_obs <- duplicated(spotify)
print(paste("There are" ,sum(duplicate_obs),"duplicate observations in the data"))
## [1] "There are 0 duplicate observations in the data"
So, all observations in our data is unique.But we can observe some track ids appearing multiple times in the data.Let us check for the duplicate track ids.
#check for duplicate track id
duplicate_id <- duplicated(spotify$track_id)
sum(duplicate_id)
## [1] 4476
As can be seen, there are 4476 duplicate track ids. This is due to the same track id being featured in different genres. This can happen as a song can have multiple genre characteristics. So, these are not true duplicate values and thus we are not removing these duplicate values.
As we will be exploring about the trend in music affinity over the years, we can add a separate column for the release year of each tracks.This can be segragated from the track_album_release_date. Also, as duration in minutes of a song can be a better identifier, we add a column representing duration in minutes.
#Adding two new columns
spotify$release_year <- as.numeric(substring(spotify$track_album_release_date,1,4))
spotify$duration_mnt <- spotify$duration_ms/(1000*60)
Few of the columns like ‘track_id’, ‘track_album_id’ and ‘playlist_id’ would not be needed for analysis beacause these contain only long alpha-numeric values. Let’s get rid of the these columns.
#removing unnecessary columns
spotify <- spotify%>%dplyr::select(-track_id,-track_album_id,-playlist_id)
Now,as we have dealt with the choice of variables, let us check the summary of our numerical variables which would be the ones we mostly need for further analysis. So, we need to check for any abnormal or outlier values which can adversly affect our analysis.
#checking summary of numerical variables
spotify_num <- spotify %>% keep(is.numeric)
summary(spotify_num)
## track_popularity danceability energy key
## Min. : 0.00 Min. :0.0000 Min. :0.000175 Min. : 0.000
## 1st Qu.: 24.00 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000
## Median : 45.00 Median :0.6720 Median :0.721000 Median : 6.000
## Mean : 42.48 Mean :0.6549 Mean :0.698603 Mean : 5.374
## 3rd Qu.: 62.00 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000
## Max. :100.00 Max. :0.9830 Max. :1.000000 Max. :11.000
## loudness mode speechiness acousticness
## Min. :-46.448 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: -8.171 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151
## Median : -6.166 Median :1.0000 Median :0.0625 Median :0.0804
## Mean : -6.720 Mean :0.5657 Mean :0.1071 Mean :0.1754
## 3rd Qu.: -4.645 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550
## Max. : 1.275 Max. :1.0000 Max. :0.9180 Max. :0.9940
## instrumentalness liveness valence tempo
## Min. :0.0000000 Min. :0.0000 Min. :0.0000 Min. : 0.00
## 1st Qu.:0.0000000 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96
## Median :0.0000161 Median :0.1270 Median :0.5120 Median :121.98
## Mean :0.0847599 Mean :0.1902 Mean :0.5106 Mean :120.88
## 3rd Qu.:0.0048300 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92
## Max. :0.9940000 Max. :0.9960 Max. :0.9910 Max. :239.44
## duration_ms release_year duration_mnt
## Min. : 4000 Min. :1957 Min. :0.06667
## 1st Qu.:187805 1st Qu.:2008 1st Qu.:3.13008
## Median :216000 Median :2016 Median :3.60000
## Mean :225797 Mean :2011 Mean :3.76328
## 3rd Qu.:253581 3rd Qu.:2019 3rd Qu.:4.22635
## Max. :517810 Max. :2020 Max. :8.63017
We can observe some outlier values in some of the variables from the summary of the dataset. For a better understanding, we will plot the boxplots of these concerned variables.
#plotting boxplots of numeric variables
par(mfrow = c(2,5), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
attach(spotify)
boxplot(danceability, col = "turquoise", pch = 19)
mtext("danceability", cex = 0.8, side = 1, line = 2)
boxplot(energy, col = "turquoise", pch = 19)
mtext("energy", cex = 0.8, side = 1, line = 2 )
boxplot(key, col = "turquoise", pch = 19)
mtext("key", cex = 0.8, side = 1, line = 2)
boxplot(loudness, col = "turquoise", pch = 19)
mtext("loudness", cex = 0.8, side = 1, line = 2)
boxplot(speechiness, col = "turquoise", pch = 19)
mtext("speechiness", cex = 0.8, side = 1, line = 2)
boxplot(acousticness, col = "turquoise", pch = 19)
mtext("acousticness", cex = 0.8, side = 1, line = 2)
boxplot(instrumentalness, col = "turquoise", pch = 19)
mtext("instrumentalness", cex = 0.8, side = 1, line = 2)
boxplot(liveness, col = "turquoise", pch = 19)
mtext("liveness", cex = 0.8, side = 1, line = 2)
boxplot(valence, col = "turquoise", pch = 19)
mtext("valence", cex = 0.8, side = 1, line = 2)
boxplot(tempo, col = "turquoise", pch = 19)
mtext("tempo", cex = 0.8, side = 1, line = 2)
Observing the boxplots, we can find that:
Distribution of Loudness variable
To investigate about the outlier values in the variable, let us have a look into the distribution of the characteristic across different genres
#boxplot of loudness across various genres
spotify %>%
ggplot( aes(x=loudness, y=playlist_genre, fill=loudness)) +
geom_boxplot(fill = "#4271AE") +
xlab("loudness") +
theme(legend.position="none")+
ggtitle("Loudness Distribution across Genres")
As we can see from the boxplot, the Genre ‘latin’ does have relatively more outliers on the left side than other genres. Therefore, this minimum number might be true as genre latin looks like to have a low loudness characteristic.Besides, data dictionary says loudness values typically range between -60 db and 0 db, so the minimum value : -46.448 is acceptable. Hence, we are not removing these outliers.
Distribution of Instrumentalness
The mean and the median values of the variable differs by a large extent.We will plot a histogram of the variable to understand more about this abnormality in the distribution.
#histogram of instrumentalness
spotify %>%
ggplot(aes(x=instrumentalness))+
geom_histogram(binwidth = 0.1, bins = 10,fill = "turquoise")+
ggtitle("Histogram of Instrumentalness")
We can see that most of the observations have a value close to 0 and this is what causing the high difference between mean and median. But we cannot conclude that these values are outliers as it can be a characteristic of the tracks. Also since the number of observations with such values are quite high, we are not amending any of these values as it can affect our model.
Removing outliers from Tempo
As the minimum value 0 for characteristic “tempo” does not make any sense , it looks like an outlier. We will remove the outliers.
#removing outliers
spotify <- spotify[-which(spotify$tempo == min(spotify$tempo)),]
Our dataset has been cleaned and is now ready for Exploratory Data Analysis and further prediction modelling. We will have a look at the dimensions of our cleaned data.
#checking dimensions of dataset
print(paste("There are ",dim(spotify)[1],"observations and",dim(spotify)[2],"columns in our cleaned dataset"))
## [1] "There are 32827 observations and 22 columns in our cleaned dataset"
Let us see how our dataset looks like now.
#showing dataset
datatable(head(spotify,100),
class = 'row-border stripe hover compact',
rownames = F,
autoHideNavigation = T, escape =FALSE)
The distribution of our dataset across various genres is as following:
#displaying frequencies across genres
kable(spotify %>%
group_by(playlist_genre) %>%
summarise(total = n())) %>%
kable_material(c("striped", "hover"))
| playlist_genre | total |
|---|---|
| edm | 6043 |
| latin | 5153 |
| pop | 5507 |
| r&b | 5431 |
| rap | 5743 |
| rock | 4950 |
| Statistic | track_popularity | danceability | energy | key | loudness | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_mnt |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | 0.00 | 0.0000 | 0.0002 | 0.000 | -46.448 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.00 | 0.0667 |
| Max | 100.00 | 0.9830 | 1.0000 | 11.000 | 1.275 | 0.9180 | 0.9940 | 0.9940 | 0.9960 | 0.9910 | 239.44 | 8.6302 |
| Median | 45.00 | 0.6720 | 0.7210 | 6.000 | -6.166 | 0.0625 | 0.0804 | 0.0000 | 0.1270 | 0.5120 | 121.98 | 3.6000 |
| Mean | 42.48 | 0.6549 | 0.6986 | 5.374 | -6.720 | 0.1071 | 0.1754 | 0.0848 | 0.1902 | 0.5106 | 120.88 | 3.7633 |
“This is my favorite part about analytics: Taking boring flat data and bringing it to life through visualization” - John Tukey
Now, as we have a clean and well defined dataset, let us perform various visualizations to gain more insights about the data.
Let us start with the popularity analysis. For the purpose of this study, I am planning to classify track popularity attribute into different classes of low,medium and high popularity. As the dictionary mentions, track popularity is a value between 0 and 100. I am classifying the group as follows:
spotify <- spotify %>%
mutate(popularity = case_when(track_popularity <= 30 ~ "low",
track_popularity > 30 & track_popularity <= 75 ~ "medium",
track_popularity > 75 ~ "high"))
Popularity can be defined with respect to tracks and artists. We can now see which are the popular tracks as well as popular artists.
Top tracks
Let us find out the top tracks in the dataset:
popular_track <- spotify %>%
filter(popularity >= 75) %>%
arrange(desc(track_popularity)) %>%
distinct(track_name, track_popularity)
datatable(
head(popular_track,10),
extensions = 'FixedColumns',
options = list(
scrollY = "400px",
scrollX = TRUE,
fixedColumns = TRUE
)
)
As we can see, “Dance Monkey” is the most popular song in our dataset. It is also the only song with a popularity of 100. Try checking your favourite song in the list.
Artist with more number of popular songs
I am also interested to know which artist has more popular songs to his name. Particularly, I am excited to know if ‘Drake’ , who is my current favourite , appears in the top artists list.
top_artist <-
spotify %>%
dplyr::select(track_artist,track_popularity,popularity) %>%
filter(popularity == "high") %>%
arrange(desc(track_popularity)) %>%
count(track_artist) %>%
arrange(-n) %>%
head(10)
top_artist %>%
ggplot(aes(reorder(track_artist, n), n)) +
geom_col(fill = "#f68060") +
coord_flip() +
labs(x = 'Artist', y = 'No: of songs', title = 'Top 10 Popular Artists') +
theme(plot.title = element_text(hjust = 0.5),legend.position = 'bottom') +
geom_text(aes(label = n), nudge_y = 1)
Naah! he is not there in the top 10 list. ‘Ed Sheeran’ is the artist with the most number of popular songs(39) to his name. Who are your favourite artists? Do they feature in the list?
Top artist by genre
The top artists list features many edm artists. This may be due to the high popularity of edm songs. So, what about the artists who creates songs in other genres.We will try to find out who are the top artists in each genre.We can use a tree map to analyze this.
artist_genre <- spotify %>% dplyr::select(playlist_genre,track_artist,track_popularity) %>% group_by(playlist_genre,track_artist) %>% summarise(n = n()) %>% top_n(10, n)
tm <- treemap(artist_genre, index = c("playlist_genre", "track_artist"), vSize = "n", vColor = 'playlist_genre', palette = viridis(6),title="Top 10 Track Artists within each Playlist Genre")
I am happy to find Drake’s name in the top list of both r&b and rap.
Most common words in popular song title
I thought it would be interesting to look at a slightly more unexplored place, and that is the title of the track. Just by looking at the title alone, could we pick up some traction on why songs succeed and fail? I am not sure. But , I would like to see the common words used in the title of popular songs. For this analysis, I am considering the tracks with high popularity and medium popularity and creating a wordcloud.
# Create a vector containing only the text
songs_popular <- spotify %>%
dplyr::select(track_name,popularity) %>%
filter(popularity == "high"|popularity == "medium")
text <- songs_popular$track_name
# Create a corpus
docs <- Corpus(VectorSource(text))
#clean text data
docs <- docs %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords,c("feat","edit","remix","remastered","remaster","radio","version","original","mix","edm","rock","latin","pop","rap","r&b","music","tãº"))
#create a doument-term matrix
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
#generate the word cloud
set.seed(101)
wordcloud(words = df$word, freq = df$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
We can see that the word ‘Love’ is the most common appearing word in the title of popular songs. We can also notice some other common words such as ‘Like’,‘Dont’,‘One’ etc.
Though the structure of each song is in some way unique, there are definitely some common threads happening. Let us check for the correlation between various attributes of a song.
attributes <- spotify[c(9:12,14:19,22)]
att_cor <- attributes %>% cor() %>%
ggcorrplot(type = "lower", hc.order = TRUE, colors = brewer.pal(n = 3, name = "RdYlBu"))
att_cor
From the correlation plot, we can observe that:
There exists a high positive correlation between energy and loudness.
There exists a high negative correlation between energy and acousticness.
There are moderate correlation between loudness and acousticness, and between valence and danceability.
We can also observe that speechiness, tempo and key have no strong correlation with track popularity. Thus, we can conclude that popularity is influenced by the following charateristics:
This study can be helpful to us when we try to build a predictive model.
Music trends change everyday and it got me thinking about what genres of music define each decade? What are the changes in characteristics of music during these periods?
To find out, I am classifying the release years of songs into the corresponding decades(named as release_era in the dataset) and trying to visualize them.
spotify$release_era <- ifelse(spotify$release_year < 1970 , "1960's",
ifelse( spotify$release_year < 1980, "1970's",
ifelse( spotify$release_year < 1990, "1980's",
ifelse( spotify$release_year < 2000,"1990's",
ifelse( spotify$release_year < 2010, "2000's", "2010's")))))
Now,let us see which genres are the most popular during each decades.
trend <- spotify %>% select (release_era , playlist_genre,track_popularity) %>%
group_by (release_era ,playlist_genre) %>%
summarise(rating = mean(track_popularity))
trend_plot <- trend %>%
plot_ly(
type = 'bar',
x = trend$playlist_genre,
y = trend$rating,
hoverinfo = 'text',
mode = 'markers',
transforms = list(
list(
type = 'filter',
target = ~release_era,
groups = trend$playlist_genre,
operation = '=',
value = unique(spotify$release_era)[1]
)
)) %>% layout(
updatemenus = list(
list(
type = 'dropdown',
active = 0,
buttons = list(
list(method = "restyle",
args = list("transforms[0].value", unique(spotify$release_era)[1]),
label = unique(spotify$release_era)[1]),
list(method = "restyle",
args = list("transforms[0].value", unique(spotify$release_era)[2]),
label = unique(spotify$release_era)[2]),
list(method = "restyle",
args = list("transforms[0].value", unique(spotify$release_era)[3]),
label = unique(spotify$release_era)[3]),
list(method = "restyle",
args = list("transforms[0].value", unique(spotify$release_era)[4]),
label = unique(spotify$release_era)[4]),
list(method = "restyle",
args = list("transforms[0].value", unique(spotify$release_era)[5]),
label = unique(spotify$release_era)[5]),
list(method = "restyle",
args = list("transforms[0].value", unique(spotify$release_era)[6]),
label = unique(spotify$release_era)[6])
)
)
)
)
trend_plot
Music in the 21st Century
If the music biz could strike a pose for the ‘10-year challenge’ (the social-media craze comparing selfies from the start and end of this decade), then its glossy 2009 shot would surely be upstaged by a more impulsive, increasingly worldly 2019 vision. 2010s has actually been quite the transformative decade. Not only have the faces of music changed,but the hierarchy of music genres has also been rearranged. Let us try to find the change in various song attributes during this period.
trend <- spotify %>% group_by(release_year) %>% filter(release_year >2010) %>% summarise(popularity_avg = mean(track_popularity),danceability_avg = mean(danceability),energy_avg = mean(energy),loudness_avg = mean(loudness),duration_avg = mean(duration_mnt),speechiness_avg = mean(speechiness))
t1 <- ggplot(trend,aes(x = release_year,y = popularity_avg))+
geom_line(color = "#00AFBB", size = 1)+
scale_x_continuous(breaks=seq(2011, 2020, 1))
t2 <- ggplot(trend,aes(x = release_year,y = danceability_avg))+
geom_line(color = "#00AFBB", size = 1)+
scale_x_continuous(breaks=seq(2011, 2020, 1))
t3 <- ggplot(trend,aes(x = release_year,y = energy_avg))+
geom_line(color = "#00AFBB", size = 1)+
scale_x_continuous(breaks=seq(2011, 2020, 1))
t4 <- ggplot(trend,aes(x = release_year,y = loudness_avg))+
geom_line(color = "#00AFBB", size = 1)+
scale_x_continuous(breaks=seq(2011, 2020, 1))
t5 <- ggplot(trend,aes(x = release_year,y = duration_avg))+
geom_line(color = "#00AFBB", size = 1)+
scale_x_continuous(breaks=seq(2011, 2020, 1))
t6 <- ggplot(trend,aes(x = release_year,y = speechiness_avg))+
geom_line(color = "#00AFBB", size = 1)+
scale_x_continuous(breaks=seq(2011, 2020, 1))
grid.arrange(t1,t2,t3,t4,t5,t6,ncol = 2)
From the above plots , here are the findings about music affinity in the 21st century:
One of Spotify’s most popular features is its Discover Playlist, a playlist that is generated each week based on a user’s listening habits. As a Spotify user I have found these playlists to be extremely accurate and useful. I wanted to make a try to build a basic version of it, a song recommendation engine based on different attributes as follows:
Here is the code snippet for the R shiny app- Song Recommendation Engine:
spotify$mood <- ifelse(spotify$danceability >= median(spotify$danceability),"Party/Dance",
ifelse(spotify$energy >= median(spotify$energy),"Gym",
ifelse(spotify$valence >= median(spotify$valence),"Cheerful","Others")))
shinyUI(navbarPage(theme = shinytheme("superhero"),"Song recommender",
tabPanel("Based on Genre",
sidebarPanel(
# Genre Selection
selectInput(inputId = "Columns", label = "Which genres do you like?",
unique(spotify$playlist_genre), multiple = FALSE),
verbatimTextOutput("rock"),
sliderInput(inputId = "range", label = "Range of Ratings that you wish to listen?",
min = min(spotify$track_popularity),max = 100,value = c(50,100))
),
mainPanel(
h2("Top songs of the genre"),
DT::dataTableOutput(outputId = "songsreco")
)
),
tabPanel("Based on Artist",
sidebarPanel(selectInput(inputId = "singers", label = "Which singer do you like?",
unique(spotify$track_artist), multiple = FALSE),
verbatimTextOutput("Ed Sheeran"),
sliderInput(inputId = "range_2", label = "Range of Ratings that you wish to listen?",
min = min(spotify$track_popularity),max = 100,value = c(50,100))),
mainPanel(
h2("Top songs of the artist"),
DT::dataTableOutput(outputId = "songsreco_artist"))),
tabPanel("Based on Mood",
sidebarPanel(selectInput(inputId = "Mood", label = "Which mood songs do you like to listen?",
unique(spotify$mood), multiple = FALSE),
verbatimTextOutput("Party/Dance"),
sliderInput(inputId = "range_4", label = "Range of Ratings that you wish to listen?",
min = min(spotify$track_popularity),max = 100,value = c(50,100))),
mainPanel(
h2("Top songs of the mood"),
DT::dataTableOutput(outputId = "songsreco_mood"))),
tabPanel("Based on Era",
sidebarPanel(
# Genre Selection
selectInput(inputId = "Era", label = "Which era song do you like to listen?",
unique(spotify$release_era), multiple = FALSE),
verbatimTextOutput("2010's"),
sliderInput(inputId = "range_3", label = "Range of Ratings that you wish to listen?",
min = min(spotify$track_popularity),max = 100,value = c(50,100))
),
mainPanel(
h2("Top songs of the Era"),
DT::dataTableOutput(outputId = "songsreco_era")
)
)
))
shinyServer(function(input, output) {
datasetInput <- reactive({
# Filtering the books based on genre and rating
spotify %>% filter(playlist_genre %in% as.vector(input$Columns)) %>%
group_by(track_name) %>% filter(track_popularity >= as.numeric(input$range[1]), track_popularity <= as.numeric(input$range[2])) %>%
arrange(desc(track_popularity)) %>%
select(track_name, track_artist, track_popularity, playlist_genre) %>%
rename(`song` = track_name, `Genre(s)` = playlist_genre)
})
datasetInput2 <- reactive({
# Filtering the books based on artists and rating
spotify %>% filter(track_artist %in% as.vector(input$singers)) %>%
group_by(track_name) %>% filter(track_popularity >= as.numeric(input$range_2[1]), track_popularity <= as.numeric(input$range_2[2])) %>%
arrange(desc(track_popularity)) %>%
select(track_name, track_artist, track_popularity, playlist_genre) %>%
rename(`song` = track_name, `Genre(s)` = playlist_genre)
})
datasetInput3 <- reactive({
# Filtering the books based on era and rating
spotify %>% filter(release_era %in% as.vector(input$Era)) %>%
group_by(track_name) %>% filter(track_popularity >= as.numeric(input$range_3[1]), track_popularity <= as.numeric(input$range_3[2])) %>%
arrange(desc(track_popularity)) %>%
select(track_name, track_artist, track_popularity, playlist_genre) %>%
rename(`song` = track_name, `Genre(s)` = playlist_genre)
})
datasetInput4 <- reactive({
# Filtering the books based on mood and rating
spotify %>% filter(mood %in% as.vector(input$Mood)) %>%
group_by(track_name) %>% filter(track_popularity >= as.numeric(input$range_4[1]), track_popularity <= as.numeric(input$range_4[2])) %>%
arrange(desc(track_popularity)) %>%
select(track_name, track_artist, track_popularity, playlist_genre) %>%
rename(`song` = track_name, `Genre(s)` = playlist_genre)
})
#Rendering the table
output$songsreco <- DT::renderDataTable({
DT::datatable(head(datasetInput(), n = 50), escape = FALSE, options = list(scrollX = '1000px'))
})
output$songsreco_artist <- DT::renderDataTable({
DT::datatable(head(datasetInput2(), n = 100), escape = FALSE, options = list(scrollX = '1000px'))
})
output$songsreco_era <- DT::renderDataTable({
DT::datatable(head(datasetInput3(), n = 50), escape = FALSE, options = list(scrollX = '1000px'))
})
output$songsreco_mood <- DT::renderDataTable({
DT::datatable(head(datasetInput4(), n = 50), escape = FALSE, options = list(scrollX = '1000px'))
})
})
In this section, I am trying to come up with a model which can predict the popularity of a song given all other attributes. More particulary, the model can help to predict in which popularity class: low,medium or high does the song feature by comparing its other attributes.
Logistic Regression with multinomial variables
We can make use of a logistic regression with multinomial variables as there are three different popularity classes in our response variable. We have seen from the correlation plot during our exploratory data analysis that the track popularity has correlation with variables : acousticness, loudness, valence, danceability, liveness, energy and instrumentalness. So it is a good idea to build the model by fitting the popularity class with all these attributes. First step is to randomly split the whole dataset into training (75%) and testing (25%) set for model validation. I would train the model with the training set and then test the perdictive capability of the model using the testing set.
spotify_train <- spotify[c(9:10,12,15:16,22:23)]
set.seed(123)
train_idx <- sample(nrow(spotify_train), .70*nrow(spotify_train))
train <- spotify_train[train_idx,]
test <- spotify_train[-train_idx,]
Now , let us perform the model fitting and analysis: When we build logistic models we need to set one of the levels of the dependent variable as a baseline. We achieve this by using relevel() function.
# Setting the baseline
train$popularity <- relevel(factor(train$popularity), ref = "low")
Once the baseline has been specified, we use multinom() function to fit the model and then use summary() function to explore the beta coefficients of the model.
# Training the multinomial model
multinom.fit <- multinom( popularity ~ . -1, data = train)
## # weights: 21 (12 variable)
## initial value 25243.913169
## iter 10 value 19314.466086
## iter 20 value 19187.485171
## iter 20 value 19187.485166
## iter 20 value 19187.485166
## final value 19187.485166
## converged
# Checking the model
summary(multinom.fit)
## Call:
## multinom(formula = popularity ~ . - 1, data = train)
##
## Coefficients:
## danceability energy loudness acousticness instrumentalness
## high 2.906834 -2.224564 0.18568060 1.2284163 -3.5823068
## medium 1.048789 0.496183 -0.02053985 0.9927426 -0.6678725
## duration_mnt
## high -0.1773257
## medium -0.1558312
##
## Std. Errors:
## danceability energy loudness acousticness instrumentalness
## high 0.16496379 0.14760476 0.011879543 0.13411631 0.32585038
## medium 0.08555869 0.07401473 0.005591354 0.07817619 0.06223128
## duration_mnt
## high 0.02961251
## medium 0.01399996
##
## Residual Deviance: 38374.97
## AIC: 38398.97
The output of summary contains the table for coefficients and a table for standard error. Each row in the coefficient table corresponds to the model equation. This ratio of the probability of choosing other popularity classes over the baseline class that is “low” is referred to as relative risk (often described as odds). However, the output of the model is the log of odds. To get the relative risk IE odds ratio, we need to exponentiate the coefficients.
# extracting coefficients from the model and exponentiate
exp(coef(multinom.fit))
## danceability energy loudness acousticness instrumentalness
## high 18.298773 0.1081146 1.2040376 3.415816 0.02781147
## medium 2.854193 1.6424402 0.9796697 2.698626 0.51279838
## duration_mnt
## high 0.8375070
## medium 0.8557036
The relative risk ratio for a one-unit increase in the variables for being in high and medium popularity classes vs. low popularity class is shown in the above output. Here a value of 1 represents that there is no change. However, a value greater than 1 represents an increase and value less than 1 represents a decrease. We can also use probabilities to understand our model.
head(probability.table <- fitted(multinom.fit))
## low high medium
## 2986 0.3086634 0.07597690 0.6153597
## 29931 0.4320298 0.01013535 0.5578348
## 29716 0.3413449 0.07568651 0.5829686
## 2757 0.3950219 0.02060377 0.5843743
## 9645 0.3331585 0.06574225 0.6010993
## 31319 0.2981284 0.08629185 0.6155798
The table above indicates that the probability of 2986th obs being in the medium popularity is 61.53%, it being low popularity is 8.9%and it being high popularity is 0.07%. Thus we can conclude that the 2986th observation is medium popular. On a similar note – 29931th observation is medium popularity, 29716th observations is also medium popularity and so on. We will now check the model accuracy by building classification table. So let us first build the classification table for training dataset and calculate the model accuracy.
# Predicting the values for train dataset
train$precticed <- predict(multinom.fit, newdata = train, "class")
# Building classification table
ctable <- table(train$popularity, train$precticed)
# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(ctable))/sum(ctable))*100,2)
## [1] 61.8
Accuracy in training dataset is 61.8%. We now repeat the above on the testing dataset.
# Predicting the values for train dataset
test$precticed <- predict(multinom.fit, newdata = test, "class")
# Building classification table
ctable <- table(test$popularity, test$precticed)
# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(ctable))/sum(ctable))*100,2)
## [1] 60
We were able to find out a model which predicts the popularity class with a 60% accuracy.
Problem statement
The main objective of our study was to find out how the attributes of a song can affect the song popularity.We had also decided to carry out analysis of the popularity of songs , the top artits , top tracks and a general trend of music affinity over the decades.
Methodology used for Analysis
Insights from the analysis
By performing the analysis as mentioned in our methodology , we came up with some interesting findings. Some of which are:
Implications
Limitations