Analyzing trends in music popularity using Spotify Dataset

INTRODUCTION

Introduction

“I am not an inventor , I just want to make things better” - Daniel Ek (Co-founder & CEO, Spotify)

Everybody knows about it. Spotify has seemingly taken the world by storm the past few years, recently reaching with 345 million users ,including 150 million premium subscribers. After its launch in 2008, the product has grown so much, and it has been amazing to see what it has become today. Their bread and butter is a library of millions of songs (over 40 million) and a massive number of playlists created by both mobile app users and Spotify’s own algorithm system. From its mobile machine-learning, artificial intelligence and data sifting technology, Spotify analyses your listening habits and builds out customized recommendations. This includes playlists and music suggestions based on the genres and artists you are listening to regularly.

Problem Statement

Why am I interested in this?

I have been an active Spotify user myself since 2019. Being so, I have always had a curiosity to know how could Spotify classify songs into such broad genres? What is the feature of each genre, and how features of a song can determine its genre? I have decided to analyze the Spotify dataset to have a greater understanding of the type of genres, tracks, and artists the consumers have been listening to on Spotify. Also, to analyze the characteristics that effects the popularity of a track and to find the trend in music popularity over the years.

Objectives:

The main objectives I am having for this analysis are:

Identify the most popular tracks,artists and genre on Spotify.
Identify characteristics that effects the popularity of a track.
Group tracks based on their characteristics.
Analyze the correlation between various characteristics of a track.
Identify trends in music affinity of listeners over the years.

Methodology

My methodology is to use the Spotify dataset available publicly. First, the data would be cleaned for a better quality by removing NULL and duplicate values and then analyzing data using various univariate and multivariate techniques to achieve our objectives. I would then present the findings in a well explained manner using Exploratory Data Analysis. I would also be using regression and clustering techniques to have a better understanding of the relationship between various characteristics.

Consumers of the analysis

I believe this analysis can help artists to understand what their audience is looking for and help them improve the popularity of their tracks. It can also help music distributors to streamline their music library. Additionally, It can help the Spotify team to have a better targeted content distribution by knowing different cluster analysis.

PACKAGES REQUIRED

The packages I have used for my analysis are mentioned below:

dplyr : Data manipulation using filter, joins, summarise etc
tidyverse : Allows data manipulation
ggplot2 : Used to create visualizations
DT : Filtering, pagination, and sorting of data tables in html outputs
knitR : Aligned displays of table in a html doc
kableExtra : Manipulate table styles for good visualizations
wordcloud : Generates wordclouds
treemap : Allows tree map visualizations
ggcorrplot : Used for visualizing correlation matrices and confidence intervals
formattable : Used to format table outputs
GGally : Allows in-depth EDA, works in synchrony with ggplot
purrr : Provides a complete and consistent set of tools for working with functions and vectors.
viridis : Use the color scales in this package to make plots that are pretty, better represent your data, easier to read
forcats : Provide a suite of tools that solve common problems with factors, including changing the order of levels or the values
corpus : Representing and computing on corpora.Required for generating wordcloud
tm : A framework for text mining applications within R. Used here for generating wordcloud
RColorBrewer: Provides color schemes for maps (and other graphics)
cowplot : Provides various features that help with creating publication-quality figure
plotly : For creating interactive web-based graphs
nnet : Software for multinomial log-linear models
shiny : To build interactive web apps straight from R
shinythemes : Themes for use with Shiny app
gridExtra : Provides a number of user-level functions to work with “grid” graphics. Used here to arrange multiple grid-based plots on a page.

library(dplyr) 
library(tidyverse)
library(ggplot2)
library(DT)
library(knitr)
library(kableExtra)
library(wordcloud)
library(treemap)
library(ggcorrplot)
library(formattable)
library(GGally)
library(purrr)
library(viridis)
library(forcats)
library(corpus)
library(tm)
library(RColorBrewer)
library(cowplot)
library(plotly) 
library(nnet)
library(shiny)
library(shinythemes)
library(gridExtra)

DATA PREPARATION

Data Source

The source data comes from Spotify via the spotifyr package which can be downloaded by clicking here.Charlie Thompson, Josia Parry, Donal Phipps, and Tom Wolff authored this package.The main purpose of the package was to obtain general metadata for songs (from Spotify’s API) in an easier fashion. The data contains track details from 1960 to 2020.

As the first step,importing the dataset:

#importing dataset
spotify <- read.csv("C:/Users/arunp/Desktop/UC/ACADEMICS/7025-DATA WRANGLING/MIDTERM PROJECT/spotify_songs.csv")

Let us check for the dimensions of our spotify dataset:

#checking dimensions of the dataset
dim(spotify)

## [1] 32833    23

As we can see, the dataset has 32833 observations and 23 variables.The dataset contains 15 missing values and these values have not been imputed in the original dataset.

Below is the detailed data dictionary to understand all the variables present in the dataset:

#loading data dictionary
spotify_dict <- read.csv("C:/Users/arunp/Desktop/UC/ACADEMICS/7025-DATA WRANGLING/MIDTERM PROJECT/spotify_dict.csv")
spotify_dict%>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed", "responsive"), full_width = F)

variable	class	description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	double	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Name of playlist
playlist_id	character	Playlist ID
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	double	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	double	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C?/D?, 2 = D, and so on. If no key was detected, the value is -1.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	double	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	double	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	double	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	double	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	double	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	double	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	double	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	double	Duration of song in milliseconds

Data Cleaning

We would first have a look at the structure of the dataset:

#analyzing the structure of the dataset
str(spotify)

## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

All the variables of the data are in the required classes. So, we do not need to make any changes.Let us now look for the columns containing missing values.

#Identifying missing values across columns
col_miss <- colSums(is.na(spotify))
print(col_miss[col_miss>0])

##       track_name     track_artist track_album_name 
##                5                5                5

As the number of missing values is negligibly small compared to the total number of observations , we can remove these incomplete observations from our data.

#removing missing values
spotify <-  na.omit(spotify)

We have to now check for any duplicate observations in our data.

#find number of duplicate values
duplicate_obs <- duplicated(spotify)
print(paste("There are" ,sum(duplicate_obs),"duplicate observations in the data"))

## [1] "There are 0 duplicate observations in the data"

So, all observations in our data is unique.But we can observe some track ids appearing multiple times in the data.Let us check for the duplicate track ids.

#check for duplicate track id
duplicate_id <- duplicated(spotify$track_id)
sum(duplicate_id)

## [1] 4476

As can be seen, there are 4476 duplicate track ids. This is due to the same track id being featured in different genres. This can happen as a song can have multiple genre characteristics. So, these are not true duplicate values and thus we are not removing these duplicate values.

As we will be exploring about the trend in music affinity over the years, we can add a separate column for the release year of each tracks.This can be segragated from the track_album_release_date. Also, as duration in minutes of a song can be a better identifier, we add a column representing duration in minutes.

#Adding two new columns 
spotify$release_year <- as.numeric(substring(spotify$track_album_release_date,1,4))
spotify$duration_mnt <- spotify$duration_ms/(1000*60)

Few of the columns like ‘track_id’, ‘track_album_id’ and ‘playlist_id’ would not be needed for analysis beacause these contain only long alpha-numeric values. Let’s get rid of the these columns.

#removing unnecessary columns
spotify <- spotify%>%dplyr::select(-track_id,-track_album_id,-playlist_id)

Now,as we have dealt with the choice of variables, let us check the summary of our numerical variables which would be the ones we mostly need for further analysis. So, we need to check for any abnormal or outlier values which can adversly affect our analysis.

#checking summary of numerical variables
spotify_num <- spotify %>% keep(is.numeric)
summary(spotify_num)

##  track_popularity  danceability        energy              key        
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.000175   Min.   : 0.000  
##  1st Qu.: 24.00   1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000  
##  Median : 45.00   Median :0.6720   Median :0.721000   Median : 6.000  
##  Mean   : 42.48   Mean   :0.6549   Mean   :0.698603   Mean   : 5.374  
##  3rd Qu.: 62.00   3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000  
##  Max.   :100.00   Max.   :0.9830   Max.   :1.000000   Max.   :11.000  
##     loudness            mode         speechiness      acousticness   
##  Min.   :-46.448   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: -8.171   1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151  
##  Median : -6.166   Median :1.0000   Median :0.0625   Median :0.0804  
##  Mean   : -6.720   Mean   :0.5657   Mean   :0.1071   Mean   :0.1754  
##  3rd Qu.: -4.645   3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550  
##  Max.   :  1.275   Max.   :1.0000   Max.   :0.9180   Max.   :0.9940  
##  instrumentalness       liveness         valence           tempo       
##  Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:0.0000000   1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96  
##  Median :0.0000161   Median :0.1270   Median :0.5120   Median :121.98  
##  Mean   :0.0847599   Mean   :0.1902   Mean   :0.5106   Mean   :120.88  
##  3rd Qu.:0.0048300   3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92  
##  Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910   Max.   :239.44  
##   duration_ms      release_year   duration_mnt    
##  Min.   :  4000   Min.   :1957   Min.   :0.06667  
##  1st Qu.:187805   1st Qu.:2008   1st Qu.:3.13008  
##  Median :216000   Median :2016   Median :3.60000  
##  Mean   :225797   Mean   :2011   Mean   :3.76328  
##  3rd Qu.:253581   3rd Qu.:2019   3rd Qu.:4.22635  
##  Max.   :517810   Max.   :2020   Max.   :8.63017

We can observe some outlier values in some of the variables from the summary of the dataset. For a better understanding, we will plot the boxplots of these concerned variables.

#plotting boxplots of numeric variables 

par(mfrow = c(2,5), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
attach(spotify)

boxplot(danceability, col = "turquoise", pch = 19)
mtext("danceability", cex = 0.8, side = 1, line = 2)

boxplot(energy, col = "turquoise", pch = 19)
mtext("energy", cex = 0.8, side = 1, line = 2 )

boxplot(key, col = "turquoise", pch = 19)
mtext("key", cex = 0.8, side = 1, line = 2)

boxplot(loudness, col = "turquoise", pch = 19)
mtext("loudness", cex = 0.8, side = 1, line = 2)

boxplot(speechiness, col = "turquoise", pch = 19)
mtext("speechiness", cex = 0.8, side = 1, line = 2)

boxplot(acousticness, col = "turquoise", pch = 19)
mtext("acousticness", cex = 0.8, side = 1, line = 2)

boxplot(instrumentalness, col = "turquoise", pch = 19)
mtext("instrumentalness", cex = 0.8, side = 1, line = 2)

boxplot(liveness, col = "turquoise", pch = 19)
mtext("liveness", cex = 0.8, side = 1, line = 2)

boxplot(valence, col = "turquoise", pch = 19)
mtext("valence", cex = 0.8, side = 1, line = 2)

boxplot(tempo, col = "turquoise", pch = 19)
mtext("tempo", cex = 0.8, side = 1, line = 2)

Observing the boxplots, we can find that:

There are some outlier values in “loudness” which are very low compared to other values.We will take a closer look at this variable.
The variable “instrumentalness” is also showing an abnormal distribution.We will take a closer look at this parameter distribution too.
There is an outlier values for “tempo” at value 0 which means that overall estimated tempo of a track in beats per minute is 0, which does not make any sense.So, the minimum value of zero appears to be an outlier and thus we will remove it.

Distribution of Loudness variable

To investigate about the outlier values in the variable, let us have a look into the distribution of the characteristic across different genres

#boxplot of loudness across various genres
spotify %>% 
    ggplot( aes(x=loudness, y=playlist_genre, fill=loudness)) + 
  geom_boxplot(fill = "#4271AE") +
  xlab("loudness") +
  theme(legend.position="none")+
  ggtitle("Loudness Distribution across Genres")

As we can see from the boxplot, the Genre ‘latin’ does have relatively more outliers on the left side than other genres. Therefore, this minimum number might be true as genre latin looks like to have a low loudness characteristic.Besides, data dictionary says loudness values typically range between -60 db and 0 db, so the minimum value : -46.448 is acceptable. Hence, we are not removing these outliers.

Distribution of Instrumentalness

The mean and the median values of the variable differs by a large extent.We will plot a histogram of the variable to understand more about this abnormality in the distribution.

#histogram of instrumentalness
spotify %>% 
  ggplot(aes(x=instrumentalness))+
  geom_histogram(binwidth = 0.1, bins = 10,fill = "turquoise")+
  ggtitle("Histogram of Instrumentalness")

We can see that most of the observations have a value close to 0 and this is what causing the high difference between mean and median. But we cannot conclude that these values are outliers as it can be a characteristic of the tracks. Also since the number of observations with such values are quite high, we are not amending any of these values as it can affect our model.

Removing outliers from Tempo

As the minimum value 0 for characteristic “tempo” does not make any sense , it looks like an outlier. We will remove the outliers.

#removing outliers
spotify <- spotify[-which(spotify$tempo == min(spotify$tempo)),]

Cleaned Data Set

Our dataset has been cleaned and is now ready for Exploratory Data Analysis and further prediction modelling. We will have a look at the dimensions of our cleaned data.

#checking dimensions of dataset
print(paste("There are ",dim(spotify)[1],"observations and",dim(spotify)[2],"columns in our cleaned dataset"))

## [1] "There are  32827 observations and 22 columns in our cleaned dataset"

Let us see how our dataset looks like now.

#showing dataset
datatable(head(spotify,100),
          class = 'row-border stripe hover compact', 
          rownames = F, 
          autoHideNavigation = T, escape =FALSE)

The distribution of our dataset across various genres is as following:

#displaying frequencies across genres
kable(spotify %>% 
  group_by(playlist_genre) %>% 
  summarise(total = n())) %>% 
  kable_material(c("striped", "hover"))

playlist_genre	total
edm	6043
latin	5153
pop	5507
r&b	5431
rap	5743
rock	4950

Variables of Concern

We have 22 variables in our dataset out of which 15 variables are numerical variables.We will be using these numerical variables for our further anlaysis.Let us have a final look at the summary of the concerned variables:(excluded mode,duration_ms and release_year as there is no much significance of their summary here).

Statistic	track_popularity	danceability	energy	key	loudness	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_mnt
Min	0.00	0.0000	0.0002	0.000	-46.448	0.0000	0.0000	0.0000	0.0000	0.0000	0.00	0.0667
Max	100.00	0.9830	1.0000	11.000	1.275	0.9180	0.9940	0.9940	0.9960	0.9910	239.44	8.6302
Median	45.00	0.6720	0.7210	6.000	-6.166	0.0625	0.0804	0.0000	0.1270	0.5120	121.98	3.6000
Mean	42.48	0.6549	0.6986	5.374	-6.720	0.1071	0.1754	0.0848	0.1902	0.5106	120.88	3.7633

EXPLORATORY DATA ANALYSIS

Popularity Analysis

“This is my favorite part about analytics: Taking boring flat data and bringing it to life through visualization” - John Tukey

Now, as we have a clean and well defined dataset, let us perform various visualizations to gain more insights about the data.

Let us start with the popularity analysis. For the purpose of this study, I am planning to classify track popularity attribute into different classes of low,medium and high popularity. As the dictionary mentions, track popularity is a value between 0 and 100. I am classifying the group as follows:

high - track popularity greater than 75
medium - track popularity between 30 and 75
low - track popularity less than 30

spotify <- spotify %>% 
  mutate(popularity = case_when(track_popularity <= 30 ~ "low",
                                track_popularity > 30 & track_popularity <= 75  ~ "medium",
                                track_popularity > 75 ~ "high"))

Popularity can be defined with respect to tracks and artists. We can now see which are the popular tracks as well as popular artists.

Top tracks

Let us find out the top tracks in the dataset:

popular_track <- spotify %>%
  filter(popularity >= 75) %>%
  arrange(desc(track_popularity)) %>% 
  distinct(track_name, track_popularity)

datatable(
  head(popular_track,10),
  extensions = 'FixedColumns',
  options = list(
    scrollY = "400px",
    scrollX = TRUE,
    fixedColumns = TRUE
  )
)

As we can see, “Dance Monkey” is the most popular song in our dataset. It is also the only song with a popularity of 100. Try checking your favourite song in the list.

Artist with more number of popular songs

I am also interested to know which artist has more popular songs to his name. Particularly, I am excited to know if ‘Drake’ , who is my current favourite , appears in the top artists list.

top_artist <-
  spotify %>%
  dplyr::select(track_artist,track_popularity,popularity) %>%
  filter(popularity == "high") %>%
  arrange(desc(track_popularity)) %>%
  count(track_artist) %>%
  arrange(-n) %>%
  head(10)

 top_artist %>% 
  ggplot(aes(reorder(track_artist, n), n)) + 
  geom_col(fill = "#f68060") + 
  coord_flip() +
  labs(x = 'Artist', y = 'No: of songs', title = 'Top 10 Popular Artists') +
  theme(plot.title = element_text(hjust = 0.5),legend.position = 'bottom') +
  geom_text(aes(label = n), nudge_y = 1)

Naah! he is not there in the top 10 list. ‘Ed Sheeran’ is the artist with the most number of popular songs(39) to his name. Who are your favourite artists? Do they feature in the list?

Top artist by genre

The top artists list features many edm artists. This may be due to the high popularity of edm songs. So, what about the artists who creates songs in other genres.We will try to find out who are the top artists in each genre.We can use a tree map to analyze this.

artist_genre <- spotify %>% dplyr::select(playlist_genre,track_artist,track_popularity) %>% group_by(playlist_genre,track_artist) %>% summarise(n = n()) %>% top_n(10, n)

tm <- treemap(artist_genre, index = c("playlist_genre", "track_artist"), vSize = "n", vColor = 'playlist_genre', palette =  viridis(6),title="Top 10 Track Artists within each Playlist Genre")

Above, treemap depicts top 10 track artists with in each of the playlist genre. The size of the boxes in treemap corresponds to the count tracks for the artists.
For genre edm, rock, pop, rap, latin and r&b, the top track artist are Martin Garrix, Queen, The Chainsmoker, Logic, Don Omar and Bobby Brown respectively.

I am happy to find Drake’s name in the top list of both r&b and rap.

Most common words in popular song title

I thought it would be interesting to look at a slightly more unexplored place, and that is the title of the track. Just by looking at the title alone, could we pick up some traction on why songs succeed and fail? I am not sure. But , I would like to see the common words used in the title of popular songs. For this analysis, I am considering the tracks with high popularity and medium popularity and creating a wordcloud.

# Create a vector containing only the text
songs_popular <-   spotify %>%
  dplyr::select(track_name,popularity) %>%
   filter(popularity == "high"|popularity == "medium")
 
 text <- songs_popular$track_name 

 
# Create a corpus  
docs <- Corpus(VectorSource(text))

#clean text data
 docs <- docs %>%
   tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
   tm_map(stripWhitespace)
 docs <- tm_map(docs, content_transformer(tolower))
 docs <- tm_map(docs, removeWords, stopwords("english"))
 docs <- tm_map(docs, removeWords,c("feat","edit","remix","remastered","remaster","radio","version","original","mix","edm","rock","latin","pop","rap","r&b","music","tãº"))
 
#create a doument-term matrix
 
 dtm <- TermDocumentMatrix(docs) 
 matrix <- as.matrix(dtm) 
 words <- sort(rowSums(matrix),decreasing=TRUE) 
 df <- data.frame(word = names(words),freq=words)
 
#generate the word cloud
 
 set.seed(101)
 wordcloud(words = df$word, freq = df$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
           colors=brewer.pal(8, "Dark2"))

We can see that the word ‘Love’ is the most common appearing word in the title of popular songs. We can also notice some other common words such as ‘Like’,‘Dont’,‘One’ etc.

Correlation between attributes

Though the structure of each song is in some way unique, there are definitely some common threads happening. Let us check for the correlation between various attributes of a song.

attributes <- spotify[c(9:12,14:19,22)]
att_cor <- attributes %>% cor() %>% 
          ggcorrplot(type = "lower", hc.order = TRUE, colors = brewer.pal(n = 3, name = "RdYlBu"))
att_cor

From the correlation plot, we can observe that:

There exists a high positive correlation between energy and loudness.
There exists a high negative correlation between energy and acousticness.
There are moderate correlation between loudness and acousticness, and between valence and danceability.
We can also observe that speechiness, tempo and key have no strong correlation with track popularity. Thus, we can conclude that popularity is influenced by the following charateristics:
- acousticness
- loudness
- valence
- danceability
- liveness
- energy
- instrumentalness

This study can be helpful to us when we try to build a predictive model.

Music Trend over the Decades

Music trends change everyday and it got me thinking about what genres of music define each decade? What are the changes in characteristics of music during these periods?

To find out, I am classifying the release years of songs into the corresponding decades(named as release_era in the dataset) and trying to visualize them.

spotify$release_era <- ifelse(spotify$release_year < 1970 , "1960's",
                              ifelse( spotify$release_year < 1980, "1970's",
                                      ifelse( spotify$release_year < 1990, "1980's",
                                              ifelse( spotify$release_year < 2000,"1990's",
                                                      ifelse( spotify$release_year < 2010, "2000's", "2010's")))))

Now,let us see which genres are the most popular during each decades.

trend <-  spotify %>% select (release_era , playlist_genre,track_popularity) %>% 
  group_by (release_era ,playlist_genre) %>% 
  summarise(rating = mean(track_popularity))
trend_plot <- trend %>%
  plot_ly(
    type = 'bar', 
    x = trend$playlist_genre, 
    y = trend$rating,
    hoverinfo = 'text',
    mode = 'markers', 
    transforms = list(
      list(
        type = 'filter',
        target = ~release_era,
        groups = trend$playlist_genre,
        operation = '=',
        value = unique(spotify$release_era)[1]
      )
    )) %>% layout(
      updatemenus = list(
        list(
          type = 'dropdown',
          active = 0,
          buttons = list(
            list(method = "restyle",
                 args = list("transforms[0].value", unique(spotify$release_era)[1]),
                 label = unique(spotify$release_era)[1]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(spotify$release_era)[2]),
                 label = unique(spotify$release_era)[2]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(spotify$release_era)[3]),
                 label = unique(spotify$release_era)[3]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(spotify$release_era)[4]),
                 label = unique(spotify$release_era)[4]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(spotify$release_era)[5]),
                 label = unique(spotify$release_era)[5]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(spotify$release_era)[6]),
                 label = unique(spotify$release_era)[6])
          )
        )
      )
    )
trend_plot

We can observe that latin and pop are the most popular genres of the current decade.
We can also observe that pop songs have always been a popular genre throughout the decades. Now, you can just use the above plot to check the trend of each genre across the decades.

Music in the 21st Century

If the music biz could strike a pose for the ‘10-year challenge’ (the social-media craze comparing selfies from the start and end of this decade), then its glossy 2009 shot would surely be upstaged by a more impulsive, increasingly worldly 2019 vision. 2010s has actually been quite the transformative decade. Not only have the faces of music changed,but the hierarchy of music genres has also been rearranged. Let us try to find the change in various song attributes during this period.

trend <- spotify %>% group_by(release_year) %>% filter(release_year >2010) %>% summarise(popularity_avg = mean(track_popularity),danceability_avg = mean(danceability),energy_avg = mean(energy),loudness_avg = mean(loudness),duration_avg = mean(duration_mnt),speechiness_avg = mean(speechiness))
t1 <- ggplot(trend,aes(x = release_year,y = popularity_avg))+
  geom_line(color = "#00AFBB", size = 1)+
  scale_x_continuous(breaks=seq(2011, 2020, 1))
 t2 <- ggplot(trend,aes(x = release_year,y = danceability_avg))+
  geom_line(color = "#00AFBB", size = 1)+
  scale_x_continuous(breaks=seq(2011, 2020, 1))
t3 <- ggplot(trend,aes(x = release_year,y = energy_avg))+
  geom_line(color = "#00AFBB", size = 1)+
  scale_x_continuous(breaks=seq(2011, 2020, 1))
t4 <- ggplot(trend,aes(x = release_year,y = loudness_avg))+
  geom_line(color = "#00AFBB", size = 1)+
  scale_x_continuous(breaks=seq(2011, 2020, 1))
t5 <- ggplot(trend,aes(x = release_year,y = duration_avg))+
  geom_line(color = "#00AFBB", size = 1)+
  scale_x_continuous(breaks=seq(2011, 2020, 1))
t6 <- ggplot(trend,aes(x = release_year,y = speechiness_avg))+
  geom_line(color = "#00AFBB", size = 1)+
  scale_x_continuous(breaks=seq(2011, 2020, 1))
grid.arrange(t1,t2,t3,t4,t5,t6,ncol = 2)

From the above plots , here are the findings about music affinity in the 21st century:

In the last 5 years, there is a high increase in average popularity of songs, thanks to the widespread of internet and streaming networks . 2014 can be seen as a bad year for the artists as the average popularity has been the lowest during the year.
Danceability has also increased much in the last 3 years.People are enjoying danceable songs.Thanks to Tiktok and other online talent showcase platforms.
Energy and loudness shares almost the same kind of pattern confirming their correlation with each other as seen from the correlation plot.
Another trend to be noticed is that the duration of songs shows a great decline which means that people prefer short songs compared to very long songs during the early 2010’s.
Also, the speechiness attribute also shows an increase during the years which correctly confirms our understanding that rap songs and podcasts have been much popular over the last few years.

SONG RECOMMENDATION ENGINE

One of Spotify’s most popular features is its Discover Playlist, a playlist that is generated each week based on a user’s listening habits. As a Spotify user I have found these playlists to be extremely accurate and useful. I wanted to make a try to build a basic version of it, a song recommendation engine based on different attributes as follows:

Based on Genre: Songs will be displayed as per the user preferred genre and rating scale.
Based on Artists: Songs will be filtered as per the artist preference of the user and the rating scale.
Based on Mood: Songs will be filtered as per the mood preference and rating scale specified by the user. For this purpose, songs have been classified into different groups like Gym(the songs with high energy),Cheerful(the songs with high valence),Party/Dance(the songs with high danceability) and Others.
Based on Era: Songs will be filtered as per the release era of songs the user likes to listen.

Application

Code

Here is the code snippet for the R shiny app- Song Recommendation Engine:

spotify$mood <- ifelse(spotify$danceability >= median(spotify$danceability),"Party/Dance",
                       ifelse(spotify$energy >= median(spotify$energy),"Gym",
                             ifelse(spotify$valence >= median(spotify$valence),"Cheerful","Others")))
                                     

shinyUI(navbarPage(theme = shinytheme("superhero"),"Song recommender",
                   tabPanel("Based on Genre",
                            sidebarPanel(
                              # Genre Selection
                              
                              selectInput(inputId = "Columns", label = "Which genres do you like?",
                                          unique(spotify$playlist_genre), multiple = FALSE),
                              verbatimTextOutput("rock"),
                              
                              sliderInput(inputId = "range", label = "Range of Ratings that you wish to listen?",
                                          min = min(spotify$track_popularity),max = 100,value = c(50,100))
                            ),
                            mainPanel(
                              h2("Top songs of the genre"),
                              DT::dataTableOutput(outputId = "songsreco")
                            )
                   ),
                   tabPanel("Based on Artist",
                            sidebarPanel(selectInput(inputId = "singers", label = "Which singer do you like?",
                                                     unique(spotify$track_artist), multiple = FALSE),
                                         verbatimTextOutput("Ed Sheeran"),
                                         
                                         sliderInput(inputId = "range_2", label = "Range of Ratings that you wish to listen?",
                                                     min = min(spotify$track_popularity),max = 100,value = c(50,100))),
                            mainPanel(
                              h2("Top songs of the artist"),
                              DT::dataTableOutput(outputId = "songsreco_artist"))),
                   
                   tabPanel("Based on Mood",
                            sidebarPanel(selectInput(inputId = "Mood", label = "Which mood songs do you like to listen?",
                                                     unique(spotify$mood), multiple = FALSE),
                                         verbatimTextOutput("Party/Dance"),
                                         
                                         sliderInput(inputId = "range_4", label = "Range of Ratings that you wish to listen?",
                                                     min = min(spotify$track_popularity),max = 100,value = c(50,100))),
                            mainPanel(
                              h2("Top songs of the mood"),
                              DT::dataTableOutput(outputId = "songsreco_mood"))),
                   
                   tabPanel("Based on Era",
                            sidebarPanel(
                              # Genre Selection
                              
                              selectInput(inputId = "Era", label = "Which era song do you like to listen?",
                                          unique(spotify$release_era), multiple = FALSE),
                              verbatimTextOutput("2010's"),
                              
                              sliderInput(inputId = "range_3", label = "Range of Ratings that you wish to listen?",
                                          min = min(spotify$track_popularity),max = 100,value = c(50,100))
                            ),
                            mainPanel(
                              h2("Top songs of the Era"),
                              DT::dataTableOutput(outputId = "songsreco_era")
                            )
                   )
                   
                   ))

  
  
  shinyServer(function(input, output) {
  
  datasetInput <- reactive({
    
    # Filtering the books based on genre and rating
    spotify %>% filter(playlist_genre %in% as.vector(input$Columns)) %>%
      group_by(track_name) %>% filter(track_popularity >= as.numeric(input$range[1]), track_popularity <= as.numeric(input$range[2])) %>%
      arrange(desc(track_popularity)) %>%
      select(track_name, track_artist, track_popularity, playlist_genre) %>%
      rename(`song` = track_name, `Genre(s)` = playlist_genre)
    
    
  })
  
  datasetInput2 <- reactive({
    
    # Filtering the books based on artists and rating
    spotify %>% filter(track_artist %in% as.vector(input$singers)) %>%
      group_by(track_name) %>% filter(track_popularity >= as.numeric(input$range_2[1]), track_popularity <= as.numeric(input$range_2[2])) %>%
      arrange(desc(track_popularity)) %>%
      select(track_name, track_artist, track_popularity, playlist_genre) %>%
      rename(`song` = track_name, `Genre(s)` = playlist_genre)
    
    
  })
  
  
  datasetInput3 <- reactive({
    
    # Filtering the books based on era and rating
    spotify %>% filter(release_era %in% as.vector(input$Era)) %>%
      group_by(track_name) %>% filter(track_popularity >= as.numeric(input$range_3[1]), track_popularity <= as.numeric(input$range_3[2])) %>%
      arrange(desc(track_popularity)) %>%
      select(track_name, track_artist, track_popularity, playlist_genre) %>%
      rename(`song` = track_name, `Genre(s)` = playlist_genre)
     
    
  })
  
  datasetInput4 <- reactive({
    
    # Filtering the books based on mood and rating
    spotify %>% filter(mood %in% as.vector(input$Mood)) %>%
      group_by(track_name) %>% filter(track_popularity >= as.numeric(input$range_4[1]), track_popularity <= as.numeric(input$range_4[2])) %>%
      arrange(desc(track_popularity)) %>%
      select(track_name, track_artist, track_popularity, playlist_genre) %>%
      rename(`song` = track_name, `Genre(s)` = playlist_genre)
    
    
  })
 
  
  #Rendering the table
  output$songsreco <- DT::renderDataTable({
    
    DT::datatable(head(datasetInput(), n = 50), escape = FALSE, options = list(scrollX = '1000px'))
  })
  
  output$songsreco_artist <- DT::renderDataTable({
    
    DT::datatable(head(datasetInput2(), n = 100), escape = FALSE, options = list(scrollX = '1000px'))
  })
  
  output$songsreco_era <- DT::renderDataTable({
    
    DT::datatable(head(datasetInput3(), n = 50), escape = FALSE, options = list(scrollX = '1000px'))
  })
  
  output$songsreco_mood <- DT::renderDataTable({
    
    DT::datatable(head(datasetInput4(), n = 50), escape = FALSE, options = list(scrollX = '1000px'))
  })
})

MODELLING

In this section, I am trying to come up with a model which can predict the popularity of a song given all other attributes. More particulary, the model can help to predict in which popularity class: low,medium or high does the song feature by comparing its other attributes.

Logistic Regression with multinomial variables

We can make use of a logistic regression with multinomial variables as there are three different popularity classes in our response variable. We have seen from the correlation plot during our exploratory data analysis that the track popularity has correlation with variables : acousticness, loudness, valence, danceability, liveness, energy and instrumentalness. So it is a good idea to build the model by fitting the popularity class with all these attributes. First step is to randomly split the whole dataset into training (75%) and testing (25%) set for model validation. I would train the model with the training set and then test the perdictive capability of the model using the testing set.

spotify_train <- spotify[c(9:10,12,15:16,22:23)]
set.seed(123)
train_idx <- sample(nrow(spotify_train), .70*nrow(spotify_train))

train <- spotify_train[train_idx,]
test <- spotify_train[-train_idx,]

Now , let us perform the model fitting and analysis: When we build logistic models we need to set one of the levels of the dependent variable as a baseline. We achieve this by using relevel() function.

# Setting the baseline 
train$popularity <- relevel(factor(train$popularity), ref = "low")

Once the baseline has been specified, we use multinom() function to fit the model and then use summary() function to explore the beta coefficients of the model.

# Training the multinomial model
multinom.fit <- multinom( popularity ~ . -1, data = train)

## # weights:  21 (12 variable)
## initial  value 25243.913169 
## iter  10 value 19314.466086
## iter  20 value 19187.485171
## iter  20 value 19187.485166
## iter  20 value 19187.485166
## final  value 19187.485166 
## converged

# Checking the model
summary(multinom.fit)

## Call:
## multinom(formula = popularity ~ . - 1, data = train)
## 
## Coefficients:
##        danceability    energy    loudness acousticness instrumentalness
## high       2.906834 -2.224564  0.18568060    1.2284163       -3.5823068
## medium     1.048789  0.496183 -0.02053985    0.9927426       -0.6678725
##        duration_mnt
## high     -0.1773257
## medium   -0.1558312
## 
## Std. Errors:
##        danceability     energy    loudness acousticness instrumentalness
## high     0.16496379 0.14760476 0.011879543   0.13411631       0.32585038
## medium   0.08555869 0.07401473 0.005591354   0.07817619       0.06223128
##        duration_mnt
## high     0.02961251
## medium   0.01399996
## 
## Residual Deviance: 38374.97 
## AIC: 38398.97

The output of summary contains the table for coefficients and a table for standard error. Each row in the coefficient table corresponds to the model equation. This ratio of the probability of choosing other popularity classes over the baseline class that is “low” is referred to as relative risk (often described as odds). However, the output of the model is the log of odds. To get the relative risk IE odds ratio, we need to exponentiate the coefficients.

# extracting coefficients from the model and exponentiate
exp(coef(multinom.fit))

##        danceability    energy  loudness acousticness instrumentalness
## high      18.298773 0.1081146 1.2040376     3.415816       0.02781147
## medium     2.854193 1.6424402 0.9796697     2.698626       0.51279838
##        duration_mnt
## high      0.8375070
## medium    0.8557036

The relative risk ratio for a one-unit increase in the variables for being in high and medium popularity classes vs. low popularity class is shown in the above output. Here a value of 1 represents that there is no change. However, a value greater than 1 represents an increase and value less than 1 represents a decrease. We can also use probabilities to understand our model.

head(probability.table <- fitted(multinom.fit))

##             low       high    medium
## 2986  0.3086634 0.07597690 0.6153597
## 29931 0.4320298 0.01013535 0.5578348
## 29716 0.3413449 0.07568651 0.5829686
## 2757  0.3950219 0.02060377 0.5843743
## 9645  0.3331585 0.06574225 0.6010993
## 31319 0.2981284 0.08629185 0.6155798

The table above indicates that the probability of 2986th obs being in the medium popularity is 61.53%, it being low popularity is 8.9%and it being high popularity is 0.07%. Thus we can conclude that the 2986th observation is medium popular. On a similar note – 29931th observation is medium popularity, 29716th observations is also medium popularity and so on. We will now check the model accuracy by building classification table. So let us first build the classification table for training dataset and calculate the model accuracy.

# Predicting the values for train dataset
train$precticed <- predict(multinom.fit, newdata = train, "class")

# Building classification table
ctable <- table(train$popularity, train$precticed)

# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(ctable))/sum(ctable))*100,2)

## [1] 61.8

Accuracy in training dataset is 61.8%. We now repeat the above on the testing dataset.

# Predicting the values for train dataset
test$precticed <- predict(multinom.fit, newdata = test, "class")

# Building classification table
ctable <- table(test$popularity, test$precticed)

# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(ctable))/sum(ctable))*100,2)

## [1] 60

We were able to find out a model which predicts the popularity class with a 60% accuracy.

SUMMARY

Problem statement

The main objective of our study was to find out how the attributes of a song can affect the song popularity.We had also decided to carry out analysis of the popularity of songs , the top artits , top tracks and a general trend of music affinity over the decades.

Methodology used for Analysis

Initially, we performed popularity analysis by finding out the top tracks , top artists by each genre with the help of a tree map and the artists with more number of popular songs with the help of a bar chart. We also examined the common words appearing in the title of popular songs by generating a word cloud.
We then created a correlation plot to identify the relationship between various song attributes.
Next, we tried to have a look at the trend in music affinity over the decades . We established this with the help of multiple line graphs which provided us with some interesting findings.
Finally, we developed a model which can predict the popularity class(low,medium,high) of a song given all other characteristics of the song. We made use of Logistic Regression techique with multinomial variables to establish the model.
We also developed a Song Recommendation Engine which can recommend songs according to user preference of genre,artist,mood and era.

Insights from the analysis

By performing the analysis as mentioned in our methodology , we came up with some interesting findings. Some of which are:

“Dance Monkey”, a song by Tones & I which was released in 2019 is the most popular song.Infact, it is the only song featuring in the dataset with a popularity of 100.
“Ed Sheeran”,is the artist with the most number of popular songs-39. Not much surprise in the artist with 4 Grammy wins and 14 nominations topping the list.
We saw the appearance of words like “Love”,“Like”,“Don’t” and “One” very commonly in the title of popular songs.
We found a strong relationship between the energy and loudness attributes .This is consistent with our common sense. We also found that the track popularity has correlation with the attributes : acousticness,loudness,valence,danceability, liveness, energy and instrumentalness.
From our study of trends of music over the decade, we found that latin and pop are the most popular genres of the current decade. We also observed that pop songs have always been a popular genre throughout the decades.
A more detailed analysis of the 21st Century Music revealed us many interesting facts about the changes in preference of music. Now, more people prefer short and danceable songs.Also,Rap and Podcasts with high speechiness attribute is gaining high popularity in the last few years.
We were also able to develop a prediction model which classifies songs into low , medium or high popularity by looking at its characteristics. The model was able to provide a 60 % accuracy.

Implications

We considered this project to be helpful for artists to understand what their audience is looking for and help them improve the popularity of their tracks. It was also meant to help music distributors to streamline their music library.
The observations we found out from analysis can be used by an artist to improve the popularity of their songs. Creating songs with shorter duration or highly danceable songs have more chance to gain popularity.Maybe even the title of a song might affect the popularity of a song.Artists can try including common words like “Love”,“Like” etc which we found in most of the popular song titles. Maybe those words can help them to be featured in popular playlists.
Music distributors could focus more on the genres which are popular among spotify users of current generation.Also, the genre R&B looks to gain popularity over the years. Hence, R&B artists can be collaborated for more works. Also more playlists related to danceable songs can also be included considering the popularity of danceable songs.
Users can make use of our Song Recommendation Engine to get recommendations as per their preferences.

Limitations

Even though spotify features over a 50 million songs, we are performing our analysis on a dataset with around 32k records.Using a dynamic dataset can improve the results of the analysis.
Additional attributes can also be considered which can help our analysis like including the number of times a particular song has played or the most downloaded playlists.
The dataset doesnot include any demographic attribute. Popularity of songs can be affected by the demography of the listeners. People in different countries might have different music tastes. A demographic data can provide more insights.
We have tried a linear regression model here.A clustering or neural network analysis can also be used and tried to develop a better model.
We have not considered multicollinearity of the variables while developing the model as the correlation is not that high .But if we can work with much larger dataset and find considerable collinearity between variables , we can take into account multicollinearity effect and try to remove it while building the model.