Spotify Data Analysis

Introduction

Background

With over 320 million monthly users and home to 60 million tracks, four billion playlists and 1.9 million podcasts, Spotify is one of the most popular music (and, increasingly, talk content) streaming platforms in existence. Similar to its big tech rivals and partners, much of Spotify’s success has been fueled by data and analytics. By collecting and analyzing massive amounts of listener data, Spotify can identify emerging user trends in real-time and rapidly develop new features or services to capitalize on them. One of Spotify’s major competitive advantages is it’s formidable recommendation engine. Using machine learning (ML) algorithms, natural language processing (NLP) and convolutional neural networks (CNN), Spotify is able to transform historical listening data into personalized playlists and music recommendations.

For scope this project we are interested in how track popularity is getting influenced by other attributes likes danceability, loudness, speechiness, valence etc.

Analytical Methodology

The plan is to analyze relationship between popularity and different features of the song, and also perform cluster analysis using K-means method to get idea about songs genre and random forest to predict song popularity.

Benefit of Analysis

This is mainly useful to market to the spotify users and improve their experience while using it. This analysis will help to predict popularity of the new song based on its attributes way before hitting markets.

Packages Required

Following packages were used:

tidyverse - Which will provide us functionality to model, transform, and visualize data.
ggplot2 - Used for plotting charts
plotly - For web-based graphs via the open source JavaScript graphing library plotly.js for interactive charts
corrplot - For displaying correlation matrices and confidence intervals
factoextra - To visualize the output of multivariate data analysis
funModeling - Exploratory Data Analysis and Data Preparation Tool-Box
RColorBrewer - To help you choose sensible colour schemes for figures in R
Lubridate - It is a package that eases working with Date and Time datatypes
Knitr - It enables the integration of R code into R markdown and in our case we used it to display the variables in a neat scrollable tabular format.
DT - Data objects in R can be rendered as HTML by importing this package.
cowplot - For providing addition functionalities to ggplot.
vtable - To print the summary statistics of the data
cluster : To use clustering algorithm
factoextra : Visualizing clustering algorithm
purrr : Purrr is a package that fills in the missing pieces in R’s functional programming tools: it’s designed to make your pure functions
randomForest : To perform Random Forest algorithm

library(tidyverse)
library(ggplot2)
library(plotly)
library(corrplot)
library(factoextra)
library(knitr)
library(RColorBrewer)
library(funModeling)
library(knitr)
library(lubridate)
library(DT)
library(cowplot)
library(vtable)
library(cluster)    
library(factoextra) 
library(purrr)
library(randomForest)

Data Dictionary

Description of Attributes Each row indicates 1 song and column contain attributes for each song.The attributes are as follows

track_id: Track ID on song
track_name: Title / Name of the song
track_artist: Name of the artist
track_popularity: Measure the popularity from 0 to 100 based on play number of the track
track_album_release_date: Information about the release date of the song
track_album_name: Provides us with the name of the album from which the song is in.
playlist_name: Name of the playlist which the song is in.
playlist_genre: Name of the genre related to the playlist which the song is in.
acousticness : Measure of how acoustic the track is and ranges from 0.0 to 1.0
danceability: Describes how suitable a track is for dancing. Values range from 0.0 being least danceable and 1.0 being most danceable.
duration_ms : The duration of the track in milliseconds(ms) which has been converted to minutes using transformation
energy: Measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity i.e. the enery of the song.
instrumentalness: Measure whether a track contains vocals. Sounds are treated as instrumental in this context. Values ranges from 0.0 to 1.0
speechiness: Detects the presence of spoken words in a track.Values > 0.6 might be a podcast or talk show, where 0.3 to 0.6 is the normal range for songs and if its less than 0.3 its mostly music
valence: Measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive , while tracks with low valence sound more negative.
key: Estimated overall key of the track. If key is not detected, the value is -1.
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
loudness : overall loudness of a track in decibels (dB).Values typical range between -60 and 0 dB.
mode: Mode indicates the modality (major or minor) of a track. Major is represented by 1 and minor is represented by 0.
tempo: Overall estimated tempo of a track in beats per minute (BPM).

Data Preparation

This sections contains all the procedures we have followed in preparing the data for analysis. Each step has been explained with code for those steps.

Data Source

The dataset used for this project is the Spotify Genre dataset was provided in the course curriculum

Data Loading

#### Reading Data 
spotify_songs <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

### Checking dimension of Data
dim(spotify_songs)

## [1] 32833    23

The original dataset has 32833 rows and 23 columns, which was collected from every genre, which is an interesting visualization of the spotify genre-space maintained by a genre taxonomist. The dataset includes 5000 songs for each genre, split across various sub-genre. The main purpose of the original dataset was to explore the following audio features:

Confidence Measures: Acousticness, Liveness, Speechiness, Instrumentalness
Perceptual Measures: Energy, Loudness, Danceability and Valence
Descriptors: Duration, Tempo, Key and Mode

The dataset consists of the following variables:

#### Checking column name
names(spotify_songs)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

Data Cleaning

Step 1: Handling Missing and Empty Values

#### Counting NA values in every column
colSums(is.na(spotify_songs))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

#### Removing NA values from the data 
spotify_songs <- na.omit(spotify_songs)

As we can see that the track_name,track_album_name and track_artist variables contain 5 missing values, we decided to remove them since it would hamper our analysis. A total of 5 rows were omitted, which would not have a severe impact on the insights derived from the dataset.

Step 2: Checking the structure and changing datatypes of certain variables

#### checking Structure of the data 
str(spotify_songs)

## tibble [32,828 x 23] (S3: tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32828] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32828] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32828] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32828] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32828] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32828] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32828] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32828] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32828] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32828] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32828] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32828] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32828] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32828] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32828] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32828] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32828] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32828] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32828] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32828] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32828] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32828] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32828] 194754 162600 176616 169093 189052 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 8152 9283 9284 19569 19812
##   ..- attr(*, "names")= chr [1:5] "8152" "9283" "9284" "19569" ...

#### checking Summary of the data 
summary(spotify_songs)

##    track_id          track_name        track_artist       track_popularity
##  Length:32828       Length:32828       Length:32828       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32828       Length:32828       Length:32828            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32828       Length:32828       Length:32828       Length:32828      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6549   Mean   :0.698603   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1754   Mean   :0.0847599  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187805  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225797  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253581  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

#### Changing datatype of some columns
spotify_songs<-spotify_songs %>%
  mutate(playlist_genre=as.factor(spotify_songs$playlist_genre),
         playlist_subgenre=as.factor(spotify_songs$playlist_subgenre),
         mode=as.factor(mode),
         key=as.factor(key))

Step 3: Removing Duplicate

#### removing duplicated data 
spotify_songs <- spotify_songs[!duplicated(spotify_songs$track_id),]
dim(spotify_songs)

## [1] 28352    23

Step 4: Extracting Year and Song Duration in minutes.

We aim at analyzing the trends that the data follows according to the artist name and genre types over the years that it was released in. We thereby split the track_album_release_date into year, month and day

#### Extracting Year from songs
spotify_songs <- spotify_songs %>%
separate(track_album_release_date,
c("year","month","day"),
sep = "-") 

#### Creating minutes from duration
spotify_songs<-spotify_songs %>% 
  mutate(duration_min=duration_ms/60000)

#### changing data type of year column
spotify_songs$year <- as.numeric(spotify_songs$year)

Step 5: Selecting the required columns from the dataset

#### Dropping unneccessary columns
spotify_songs <- spotify_songs %>% select(-c(track_id,track_album_id,playlist_id))

Data Preview

A preview of the clean dataset is given below:

### displaying top 100 rows
output_data <- head(spotify_songs, n = 100)
datatable(output_data, filter = 'top', options = list(pageLength = 25))

Exploratory Data Analysis

Correlation between variables

In order to understand the correlation among variables, we’ll use corrplot function in R which is one of the basic data visualization functions.

### Correlation plot of numeric columns
songs_corr <- spotify_songs %>%
select(track_popularity,danceability,energy,loudness,speechiness,acousticness,instrumentalness, liveness, valence, tempo)
par(bg="#121212")
corrplot(cor(songs_corr),method = 'pie',type="lower",bg="#121212",col="#1DB954",tl.col="#1DB954",addgrid.col = "#1DB954")

Based on the plot, we can state that popularity does not have strong correlation with other track features. But quite a few variables have strong correlation with each other, indicating multicollinearity and might not be suitable for classification algorithms.

Density Plots of Variables

Let’s see energy, danceability, valence, acousticness, speechiness and liveness are distributed over all the observations of our dataset. We would be plotting density plots of all these 6 variables together as they all on same scale and range from 0 to 1.

#### Plotting Density Plots
ggplot(spotify_songs) +
  geom_density(aes(energy, fill ="energy", alpha = 0.1)) + 
  geom_density(aes(danceability, fill ="danceability", alpha = 0.1)) + 
  geom_density(aes(valence, fill ="valence", alpha = 0.1)) + 
  geom_density(aes(acousticness, fill ="acousticness", alpha = 0.1)) + 
  geom_density(aes(speechiness, fill ="speechiness", alpha = 0.1)) + 
  geom_density(aes(liveness, fill ="liveness", alpha = 0.1)) + 
  scale_x_continuous(name = "Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
  scale_y_continuous(name = "Density") +
  ggtitle("Density plot of Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
  theme_bw() +
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(legend.title=element_blank()) +
  scale_fill_brewer(palette="Accent") + 
  theme(panel.background = element_rect(fill = "#121212")) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))

Box Plot

Genre by energy

#### Ploting Box Plot of genre by energy
ggplot(spotify_songs, aes(x=energy, y=playlist_genre)) +
  geom_boxplot(color="white", fill="darkgreen")  +
  scale_x_continuous(name = "Energy") +
  scale_y_discrete(name = "Genre") +
  theme_bw() +
  ggtitle("Variation: Energy and Genre") +
  
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(legend.title=element_blank()) +
  scale_fill_brewer(palette="Accent") + 
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))

The plot shows that EDM genre has songs with highest energy.

Genre by danceability

#### Ploting Box Plot of genre by danceability
ggplot(spotify_songs, aes(x=danceability, y=playlist_genre)) +
  geom_boxplot(color="white", fill="darkgreen")  +
  scale_x_continuous(name = "Danceability") +
  scale_y_discrete(name = "Genre") +
  theme_bw() +
  ggtitle("Danceability and Genre") +
  
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(legend.title=element_blank()) +
  scale_fill_brewer(palette="Accent") + 
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))

As seen in the graph, Rap genre has the highest danceability factor

Genre by liveness

#### Ploting Box Plot of genre by liveness
ggplot(spotify_songs, aes(x=liveness, y=playlist_genre)) +
  geom_boxplot(color="white", fill="darkgreen")  +
  scale_x_continuous(name = "Liveness") +
  scale_y_discrete(name = "Genre") +
  theme_bw() +
  ggtitle("Liveness and Genre") +
  
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(legend.title=element_blank()) +
  scale_fill_brewer(palette="Accent") + 
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))

Looks like EDM songs are most lively, followed closely by rock genre.

Genre by valence

#### Ploting Box Plot of genre by valence
ggplot(spotify_songs, aes(x=valence, y=playlist_genre)) +
  geom_boxplot(color="white", fill="darkgreen")  +
  scale_x_continuous(name = "Valence") +
  scale_y_discrete(name = "Genre") +
  theme_bw() +
  ggtitle("Valence and Genre") +
  
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(legend.title=element_blank()) +
  scale_fill_brewer(palette="Accent") + 
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))

As seen above, Latin genre has a higher valence than others

Genre by loudness

#### Ploting Box Plot of genre by loudness
ggplot(spotify_songs, aes(x=loudness, y=playlist_genre)) +
  geom_boxplot(color="white", fill="darkgreen")  +
  scale_x_continuous(name = "Loudness") +
  scale_y_discrete(name = "Genre") +
  theme_bw() +
  ggtitle("Loudness and Genre") +
  
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(legend.title=element_blank()) +
  scale_fill_brewer(palette="Accent") + 
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))

The loudness is pretty similar, only songs in EDM genre are a bit louder than the other genres.

Energy Distribution of the songs

#### Histogram of Energy Distribution
spotify_songs$energy_only <- cut(spotify_songs$energy, breaks = 10)
spotify_songs %>%
  ggplot( aes(x = energy_only )) +
  geom_bar(width = 0.2, fill = "#1DB954", colour = "black") +
  scale_x_discrete(name = "Energy") +
    theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
       text = element_text(size = 10,colour = "darkgreen")) +
  theme(axis.text.x = element_text(colour = "#1DB954"))+
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))

This plot shows that higher energy songs are popular among Spotify listeners.

Speechiness Distribution of the songs

spotify_songs$speech_only <- cut(spotify_songs$speechiness, breaks = 10)
spotify_songs %>%
  ggplot( aes(x = speech_only , colour ="1DB954")) +  
  geom_bar(width = 0.2,  fill = "#1DB954", colour = "black") +
  scale_x_discrete(name = "Speechiness") +
    theme(axis.text.x = element_text(colour = "darkgreen"))+
    theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
       text = element_text(size = 10,colour = "#1DB954")) +
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen"))+
  coord_flip()

This plot shows that less speechy songs are more favoured by maximum Spotify listeners.

Trend Analysis by Year

trend_chart <- function(arg){
trend_change <- spotify_songs %>% filter(year>2010) %>% group_by(year) %>% summarize_at(vars(all_of(arg)), funs(Average = mean)) 
  
  
chart <- ggplot(data = trend_change, aes(x = year, y = Average)) +
geom_line(color = "#1DB954", size = 1) +
scale_x_continuous(breaks=seq(2011, 2020, 1)) + scale_y_continuous(name=paste("",arg,sep=""))  +
theme(axis.text.x = element_text(colour = "darkgreen")) +
    theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
       text = element_text(size = 10,colour = "#1DB954")) +
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212")) +
     theme(axis.text.y = element_text(colour = "darkgreen"))
return(chart)
}

trend_chart_track_popularity<-trend_chart("track_popularity")
trend_chart_danceability<-trend_chart("danceability")
trend_chart_energy<-trend_chart("energy")
trend_chart_loudness<-trend_chart("loudness")
trend_chart_duration_min<-trend_chart("duration_min")
trend_chart_speechiness<-trend_chart("speechiness")



plot_grid(trend_chart_track_popularity, trend_chart_danceability, trend_chart_energy, trend_chart_loudness, trend_chart_duration_min, trend_chart_speechiness,ncol = 2, label_size = 1)

To find out the trend of how the features change across time.We can group the songs by its added year, get the average for each feature over time and visualize it.
What interests us the most is that the duration of tracks is showing continuous decreasing trend i.e. the songs are getting shorter and shorter with each year.

Summary Statistics of the clean data

### Summary statistics of all the variables available in the data
st(spotify_songs)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
track_popularity	28352	39.335	23.699	0	21	58	100
year	28352	2011.054	11.23	1957	2008	2019	2020
playlist_genre	28352
… edm	4877	17.2%
… latin	4136	14.6%
… pop	5132	18.1%
… r&b	4504	15.9%
… rap	5398	19%
… rock	4305	15.2%
playlist_subgenre	28352
… album rock	1039	3.7%
… big room	1034	3.6%
… classic rock	1100	3.9%
… dance pop	1298	4.6%
… electro house	1416	5%
… electropop	1251	4.4%
… gangster rap	1314	4.6%
… hard rock	1202	4.2%
… hip hop	1296	4.6%
… hip pop	803	2.8%
… indie poptimism	1547	5.5%
… latin hip hop	1194	4.2%
… latin pop	1097	3.9%
… neo soul	1478	5.2%
… new jack swing	1036	3.7%
… permanent wave	964	3.4%
… pop edm	967	3.4%
… post-teen pop	1036	3.7%
… progressive electro house	1460	5.1%
… reggaeton	687	2.4%
… southern hip hop	1582	5.6%
… trap	1206	4.3%
… tropical	1158	4.1%
… urban contemporary	1187	4.2%
danceability	28352	0.653	0.146	0	0.561	0.76	0.983
energy	28352	0.698	0.184	0	0.579	0.843	1
key	28352
… 0	3001	10.6%
… 1	3436	12.1%
… 2	2478	8.7%
… 3	797	2.8%
… 4	1925	6.8%
… 5	2301	8.1%
… 6	2261	8%
… 7	2907	10.3%
… 8	2066	7.3%
… 9	2631	9.3%
… 10	1972	7%
… 11	2577	9.1%
loudness	28352	-6.818	3.036	-46.448	-8.31	-4.709	1.275
mode	28352
… 0	12318	43.4%
… 1	16034	56.6%
speechiness	28352	0.108	0.103	0	0.041	0.133	0.918
acousticness	28352	0.177	0.223	0	0.014	0.26	0.994
instrumentalness	28352	0.091	0.233	0	0	0.007	0.994
liveness	28352	0.191	0.156	0	0.093	0.249	0.996
valence	28352	0.51	0.234	0	0.329	0.695	0.991
tempo	28352	120.958	26.955	0	99.972	133.999	239.44
duration_ms	28352	226574.631	61081.364	4000	187741.25	254975.25	517810
duration_min	28352	3.776	1.018	0.067	3.129	4.25	8.63
energy_only	28352
… (-0.000825,0.1]	67	0.2%
… (0.1,0.2]	225	0.8%
… (0.2,0.3]	518	1.8%
… (0.3,0.4]	1168	4.1%
… (0.4,0.5]	2368	8.4%
… (0.5,0.6]	3627	12.8%
… (0.6,0.7]	4964	17.5%
… (0.7,0.8]	5797	20.4%
… (0.8,0.9]	5763	20.3%
… (0.9,1]	3855	13.6%
speech_only	28352
… (-0.000918,0.0918]	18385	64.8%
… (0.0918,0.184]	4869	17.2%
… (0.184,0.275]	2459	8.7%
… (0.275,0.367]	1648	5.8%
… (0.367,0.459]	736	2.6%
… (0.459,0.551]	174	0.6%
… (0.551,0.643]	53	0.2%
… (0.643,0.734]	12	0%
… (0.734,0.826]	8	0%
… (0.826,0.919]	8	0%

Random Forest

In this section, we executed Random Forest on the Spotify dataset and would try to predict track popularity. We consider top 25 percentile values as popular songs rest non popular songs

#selecting relevant the data
set.seed(2021)

part<-sample(1:nrow(spotify_songs), nrow(spotify_songs)*.75)

data_for_popularity_analysis <- spotify_songs %>% 
  select(c('energy', 'liveness','tempo', 'speechiness', 'acousticness',
           'instrumentalness', 'danceability', 'duration_min' ,
           'loudness','valence' ,'track_popularity','key','mode','playlist_genre')) %>%
  mutate( track_popularity = if_else(track_popularity > 62 , 1,0))  
                  
#Splitting data in train and Test
spotify_songs_train<- data_for_popularity_analysis[part,]
spotify_songs_test <- data_for_popularity_analysis[-part,]

# running random Forest 
spotify_rand <- randomForest(as.factor(track_popularity)~., data=spotify_songs_train, mtry= 4,importance =TRUE )
spotify_rand

## 
## Call:
##  randomForest(formula = as.factor(track_popularity) ~ ., data = spotify_songs_train,      mtry = 4, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 18.75%
## Confusion matrix:
##       0   1 class.error
## 0 16947 445  0.02558648
## 1  3543 329  0.91503099

OOB value error rate is 18.75% i.e. model accuracy in training data is 82.75%

Analysis of Random Forest Model

Calculating Confusion Matrix

#Predicting songs popularity on test data
predict_test <- predict(spotify_rand, spotify_songs_test[,-11])
table(spotify_songs_test$track_popularity, predict_test)

##    predict_test
##        0    1
##   0 5687  123
##   1 1172  106

Plotting Variable Importance

###Claulating variable importance
important <- importance(spotify_rand)
varImportance <- data.frame(Variables = row.names(important),
                           Importance = round(important[,3],2))
rankImportance <- varImportance%>%
      mutate(Rank= paste('#',dense_rank(desc(Importance))))

ggplot(rankImportance,aes(x=reorder(Variables,Importance) ,y=Importance,fill=Importance))+ 
geom_bar(stat = "identity",fill = "#1DB954") +
geom_text(aes(x = Variables, y = 0.5, label = Rank),hjust=0, vjust=0.55, size = 4, colour = "black") +
  theme_bw() +
  ggtitle("Variable Importance") +
    scale_x_discrete(name = "Variables") +
  theme(axis.text.x = element_text(colour = "darkgreen"))+
  theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
        text = element_text(size = 10,colour = "#1DB954")) +
  theme(panel.background = element_rect(fill = "#121212"), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(plot.background = element_rect(fill = "#121212")) +
  theme(legend.background = element_rect(fill = "#121212"))+
  theme(axis.text.y = element_text(colour = "darkgreen")) +
coord_flip()

Conclusion

A common assumption is that energy influences popularity like energetic songs are more popular. However, we could not find and correlation between popularity and energy. Number of songs belonging to all genres in the top 100 were not evenly distributed.

Yes we have sliced the track_album_release_date variable into year,month and year. We have also created new variables track_album_release_year, popularity_group etc. We are trying to find the track popularity using different features of the song. We have used newly created variables viz. trend_chart_track_popularity, trend_chart_danceability, trend_chart_energy, trend_chart_loudness, trend_chart_duration_min, trend_chart_speechiness to find out the trends.

What interests us the most is that the duration of tracks is showing continuous decreasing trend. Meaning that the songs are getting shorter and shorter with each year. Furthermore, the danceability of tracks is on continuous rise, which is a good thing as people are enjoying danceable songs. Energy and loudness have almost the same trend each year showing high positive correlation between them. Both the features have peaks and dips on trend in the same years.

The average popularity of the songs reached its minimum value in 2014 in last 1 decade and after that it’s has been continuously increasing, depicting that the songs are becoming popular with time on average among people.

We have used trend charts to find out how the features change across time. In order to understand the correlation among variables, we have used corrplot function in R. We have used boxplots to find out the outliers.

Even though there are millions of songs that exist, we only had about 32k records for our analysis, and hence we couldn’t obtain a full picture of the features of music. Also, the analysis could be strengthened by incorporating user related features like their demographical attributes, user history etc. We used k-mean clustering algorithm to find out the repecstive genre of the song in the cluster but because perhaps data size we couldn’t clear cluster. Also we try predict a songs track’s popularity from key features about the song using Random Forest that indicate algorithm is fairly good classifying not popular but average in predict predicting songs popularity. Playlist genre and loudness are two major factor to contribute song’s popularity score