Spotify Data Analysis

By,
Sai Mounica Gudimella
Shruthi Saxena
BANA Spring 2021

Introduction

  • Spotify is a digital music, podcast, and video streaming service that gives access to millions of songs and other content from artists all over the world.
  • This dataset consists of songs released from 1957 till 2020 along with certain characteristics that determine its popularity among masses.

Problem Statement:

  • The goal of this project is to predict/identify characteristics that would generate a high track popularity -
    • Understand how audio features like danceability, loudness, speechiness determine a song’s popularity
    • Identify the leading genre in terms of count and popularity
    • Identify the top artists and top songs
    • Additionally, we will also see the kind of the songs Mounica and Shruthi enjoy listening to

Address the problem:

  • To answer the above problem statement we will employ the below techniques -
    • Perform Exploratory Data Analysis:
      • Observe the correlation between various features
      • Visualize data to learn more about each of the features which have a positive/negative impact on the popularity
    • Generate Insights:
      • Trends of popular features - Understand which genre has highest number of songs, Which is the top performing Genre/Artist etc.
      • Variability of characteristics across genres
    • Implement Clustering:
      • Apply KNN and K-means clustering algorithm - This can help in suggesting the addition of similar content by identifying a pattern from the clusters to predict the least and most popular clusters

Approach:

  • We will follow the below approach to analyze this dataset -
    • Import - Import the spotify dataset into R Studio for analysis.
    • Data Cleaning - Next, we store this data into a dataframe which is in a consitent form. Each column is a variable and each row is observation. In this step, we observe the data to identify null values/outliers and eliminate them. We have also modified the datatype of variables to their correct form
    • Data Transformation - In this step, we remove duplicate records, deleted insignificant variables and created new variables.
    • Data Visualization - This step is to identify the relation of variables with one another to understand which features to be used for prediction in the next step
    • Model - We model the data to predict/identify characteristics that would generate a high track popularity. We will be using KNN classification algorithm to analyze the dataset. Also, K-means is used to cluster into similar groups and recommend a song.

Usefulness:

  • Evaluating the preferred music content helps adding similar content or suggesting similar songs to the users which directly corresponds to better user experience
  • From this we can classify or cluster the users to create content similar to the most trending or preferred music
  • From this analysis we can further predict/recommend the type of songs that Spotify should add in future to increase its customer base and further increasing their profit

Packages Required

library(dplyr)
library(corrplot)
library(ggplot2)
library(ggthemes)
library(tidyverse)
library(kableExtra)
library(factoextra)
library(plotly)
library(skimr)
library(DT)
library(class) 
library(tseries) 
library(lmtest) 
library(forecast)
library(TSA)
  • We are using the following packages for the analysis of spotify data
    • tidyverse - for interacting with data through subsetting, transformation, visualization, etc.
    • dplyr - for leveraing data manipulation features like filtering,selecting, grouping, ordering and subsetting and transforming data
    • ggplot2 - for creating visualizations like barplot, scatterplot, boxplot etc.
    • corrplot - for visualizing correlation and correlograms
    • ggthemes - this package is used for modifying the themes of the plots
    • kableExtra - This is used for generating light weight table using kable()
    • factoextra - This will be used for generating graphs with clusters using kmeans()
    • plotly - This will be used to create interactive web-based graphs
    • skimr - This library is used for creating a table of summary statistics
    • DT - This library is used for generating datatable
    • class - This is used for KNN clustering algorithm
    • tseries - This is used for time series analysis
    • lmtest - This is used to find out model fit coefficients
    • forecast - This is used for Time Series forecast
    • TSA - This is used for time series analysis

Data Preparation

Data Source Information

Source - https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md

Data Information

The data this week comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.

Data Dictionary

variable class description
track_id character Song unique ID
track_name character Song Name
track_artist character Song Artist
track_popularity double Song Popularity (0-100) where higher is better
track_album_id character Album unique ID
track_album_name character Song album name
track_album_release_date character Date when album released
playlist_name character Name of playlist
playlist_id character Playlist ID
playlist_genre character Playlist genre
playlist_subgenre character Playlist subgenre
danceability double Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy double Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key double The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness double The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode double Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness double Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness double A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness double Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness double Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence double A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo double The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms double Duration of song in milliseconds

Data Import

raw_data <- read.csv("spotify_songs.csv")
str(raw_data)
## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
  • Spotify dataset that is imported has 32833 rows and 23 columns.
  • It can be seen that the column names are already in snake case.

Data Cleaning

Step 1: Convert the features to an appropriate data type.

raw_data$playlist_genre <- as.factor(raw_data$playlist_genre)
raw_data$playlist_subgenre <- as.factor(raw_data$playlist_subgenre)
raw_data$mode <- as.factor(raw_data$mode)
  • track_popularity, playlist_genre, playlist_subgenre, mode were converted to factors as they are categorical variables.

Step 2: Separate and create new variables.

raw_data <- separate(raw_data, track_album_release_date, c("track_album_release_year","track_album_release_month","track_album_release_day"), fill="right")
  • track_album_release_date column had missing month and date values for few rows. Using separate function, it has been split into 3 columns for year, month and date. Missing values have been assigned as NA. The reason there are NA’s in these 2 columns are because there were some dates in the original data source with just the year mentioned. Imputing these missing values with the median/mode of these respective columns would entirely give a wrong information about the release day and month. Hence we are retaining these as NA.

Step 3: Check for acceptable range of values for numeric variable to identify outliers.

numeric_data <- raw_data %>% dplyr::select(where(is.numeric))
summary(numeric_data)
##  track_popularity  danceability        energy              key        
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.000175   Min.   : 0.000  
##  1st Qu.: 24.00   1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000  
##  Median : 45.00   Median :0.6720   Median :0.721000   Median : 6.000  
##  Mean   : 42.48   Mean   :0.6548   Mean   :0.698619   Mean   : 5.374  
##  3rd Qu.: 62.00   3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000  
##  Max.   :100.00   Max.   :0.9830   Max.   :1.000000   Max.   :11.000  
##     loudness        speechiness      acousticness    instrumentalness   
##  Min.   :-46.448   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.: -8.171   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median : -6.166   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   : -6.720   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.: -4.645   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :  1.275   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810
  • Here, all the numeric columns were checked for their range. All the columns are in range as per the description of the dataset except Loudness. The data description said Values typical range between -60 and 0 db. However, data above had 6 rows with values more than 0.
  • However, going by the words in description, it was not mandatory but typical for all values to be in range of 0 to 1. Hence, the decision to keep these row intact was taken.
boxplot(numeric_data$tempo)

  • In the description, it was clear that Tempo was one of those columns that was estimated taking into account the average beat duration. The Boxplot of it showed outliers. However, it was kept as it is as there were clear information about how was it estimated exactly.

Step 4: Check for the number of missing or NA values in the dataframe

colSums(is.na(raw_data))
##                  track_id                track_name              track_artist 
##                         0                         5                         5 
##          track_popularity            track_album_id          track_album_name 
##                         0                         0                         5 
##  track_album_release_year track_album_release_month   track_album_release_day 
##                         0                      1855                      1886 
##             playlist_name               playlist_id            playlist_genre 
##                         0                         0                         0 
##         playlist_subgenre              danceability                    energy 
##                         0                         0                         0 
##                       key                  loudness                      mode 
##                         0                         0                         0 
##               speechiness              acousticness          instrumentalness 
##                         0                         0                         0 
##                  liveness                   valence                     tempo 
##                         0                         0                         0 
##               duration_ms 
##                         0
  • It is seen that there are 5 rows with no track name, track artist and album_name. These were removed from the dataset

Step 5: Remove duplicate records

duplicate_data <- na.omit(raw_data[raw_data$track_name == "Shape of You",])
duplicate_data
##                     track_id   track_name track_artist track_popularity
## 1219  7qiZfU4dY1lWllzX7mPBI3 Shape of You   Ed Sheeran               86
## 2318  7qiZfU4dY1lWllzX7mPBI3 Shape of You   Ed Sheeran               86
## 2839  7qiZfU4dY1lWllzX7mPBI3 Shape of You   Ed Sheeran               86
## 5308  5ZiGdWZvv9kaWw1UFcmDur Shape of You          EZA                2
## 17489 0FE9t6xYkqWXU2ahLh6D8X Shape of You   Ed Sheeran               75
## 23868 0FE9t6xYkqWXU2ahLh6D8X Shape of You   Ed Sheeran               75
## 23900 7qiZfU4dY1lWllzX7mPBI3 Shape of You   Ed Sheeran               86
## 30407 7qiZfU4dY1lWllzX7mPBI3 Shape of You   Ed Sheeran               86
##               track_album_id track_album_name track_album_release_year
## 1219  3T4tUhGYeRNVUGevb0wThu      <c3>· (Deluxe)                     2017
## 2318  3T4tUhGYeRNVUGevb0wThu      <c3>· (Deluxe)                     2017
## 2839  3T4tUhGYeRNVUGevb0wThu      <c3>· (Deluxe)                     2017
## 5308  5Zm0pvfZXgyVjo8uaCqcan     Shape of You                     2017
## 17489 7oJa8bPFKVbq4c7NswXHw8     Shape of You                     2017
## 23868 7oJa8bPFKVbq4c7NswXHw8     Shape of You                     2017
## 23900 3T4tUhGYeRNVUGevb0wThu      <c3>· (Deluxe)                     2017
## 30407 3T4tUhGYeRNVUGevb0wThu      <c3>· (Deluxe)                     2017
##       track_album_release_month track_album_release_day
## 1219                         03                      03
## 2318                         03                      03
## 2839                         03                      03
## 5308                         05                      19
## 17489                        01                      06
## 23868                        01                      06
## 23900                        03                      03
## 30407                        03                      03
##                                                                             playlist_name
## 1219                              Pop Ingl<c3><a9>s (2020 - 2010s)e<9f>’\231 M<c3>osica En Ingl<c3><a9>s 2010s
## 2318                                                                        post teen pop
## 2839                                                                      ElectroPop 2020
## 5308                                                                    Music&Other Drugs
## 17489                                                          Tropical House Run 190 BPM
## 23868                                                                Today's Hits (Clean)
## 23900                                                                Today's Hits (Clean)
## 30407 2010 - 2011 - 2012 - 2013 - 2014 - 2015 - 2016 - 2017 - 2018 - 2019 - 2020 TOP HITS
##                  playlist_id playlist_genre playlist_subgenre danceability
## 1219  4QAzO1Z92yqqD6Mf3mOANu            pop         dance pop        0.825
## 2318  6rjxP7GQKoqqgoakzxl3PY            pop     post-teen pop        0.825
## 2839  4frhr6RQM2fMOm2mpvOVo6            pop        electropop        0.825
## 5308  5jROYSZSL7cO0jGAqkPx7C            pop   indie poptimism        0.596
## 17489 37i9dQZF1DWSTc9FdySHtz          latin          tropical        0.825
## 23868 7ENISpOJhocpMJVcGb0qcT            r&b           hip pop        0.825
## 23900 7ENISpOJhocpMJVcGb0qcT            r&b           hip pop        0.825
## 30407 2DjIfVDXGYDgRxw7IJTKVb            edm           pop edm        0.825
##       energy key loudness mode speechiness acousticness instrumentalness
## 1219   0.652   1   -3.183    0      0.0802        0.581         0.000000
## 2318   0.652   1   -3.183    0      0.0802        0.581         0.000000
## 2839   0.652   1   -3.183    0      0.0802        0.581         0.000000
## 5308   0.390  11   -9.708    0      0.2250        0.596         0.000368
## 17489  0.652   1   -3.183    0      0.0802        0.581         0.000000
## 23868  0.652   1   -3.183    0      0.0802        0.581         0.000000
## 23900  0.652   1   -3.183    0      0.0802        0.581         0.000000
## 30407  0.652   1   -3.183    0      0.0802        0.581         0.000000
##       liveness valence   tempo duration_ms
## 1219    0.0931   0.931  95.977      233713
## 2318    0.0931   0.931  95.977      233713
## 2839    0.0931   0.931  95.977      233713
## 5308    0.4870   0.652 176.000      225682
## 17489   0.0931   0.931  95.977      233713
## 23868   0.0931   0.931  95.977      233713
## 23900   0.0931   0.931  95.977      233713
## 30407   0.0931   0.931  95.977      233713
  • We notice above that there are duplicate track_id values, with the values of all other features identical. So these were identified and removed.
  • Next, we observed there were duplicate values in track_name with all other features identical except for track_popularity and track_album. We have removed such duplicate records as well. This is because we would want to every track_name to have exactly one track_popularity associated to analyze the music features for it. If we retained the duplicate column, this would imply the same song having different popularity value but same feature values of danceability, loudness etc. which is incorrect.
raw_data <- raw_data[!duplicated(raw_data$track_id),]
raw_data <- raw_data[!duplicated(raw_data[,c("track_name","track_artist")]),]
raw_data <- subset (raw_data, select = -c(1,5,6,10,11))
clean_data <- raw_data[!is.na(raw_data$track_name),]
colSums(is.na(clean_data))
##                track_name              track_artist          track_popularity 
##                         0                         0                         0 
##  track_album_release_year track_album_release_month   track_album_release_day 
##                         0                      1464                      1485 
##            playlist_genre         playlist_subgenre              danceability 
##                         0                         0                         0 
##                    energy                       key                  loudness 
##                         0                         0                         0 
##                      mode               speechiness              acousticness 
##                         0                         0                         0 
##          instrumentalness                  liveness                   valence 
##                         0                         0                         0 
##                     tempo               duration_ms 
##                         0                         0
  • The clean data now has no NA values in all columns except release month and day.

Clean Data View

head(clean_data, 20) %>%
  datatable(options = list(scrollCollapse = TRUE,scrollX = TRUE,
  columnDefs = list(list(className = 'dt-center', targets = 1:4))
  ))

Summary

skim(clean_data) %>%
  datatable(options = list(scrollCollapse = TRUE,scrollX = TRUE,
  columnDefs = list(list(className = 'dt-center', targets = 1:4))
  ))

Exploratory Data Analysis

Overview

Correlation

spotify_correlation <- cor(clean_data[,-c(1,2,4,5,6,7,8,13)])
corrplot(spotify_correlation, 
         type="upper", 
         method="number", order="alphabet",
         addCoef.col="black", 
         diag=FALSE,
         tl.srt=45,
         tl.col="black",col = colorRampPalette(c("midnightblue", "white","black"))(100))

  • From this we can observe that there is a significant positive linear correlation between loudness and energy.
  • There is a significant negative linear correlation between acousticness and energy and negative relation of acousticness with loudness.

Genre and its sub-genre

theme_set(theme_bw()) 

genre_overview <- ggplot(clean_data, aes(x = factor(playlist_subgenre))) +
  geom_bar(width = 1, aes(fill = playlist_genre), colour = "black") +
theme(axis.text.x = element_text(angle =45, vjust = 20,hjust = 0))
genre_overview  + coord_polar()

  • The number of songs rap genre are highest
  • Rock genre has the least number of songs

Genre Characteristics

ggplot(clean_data, aes(x=danceability, fill=playlist_genre))+ geom_density(alpha=0.4) +theme_foundation()+ labs(x="Danceability", y="Density") + guides(fill=guide_legend(title="Genres"))+ ggtitle("Danceability distribution per Genre")

  • From this density plot, we can observe that rap songs are highly danceable when compared to rock songs are least danceable.
ggplot(clean_data, aes(x=energy, fill=playlist_genre))+ geom_density(alpha=0.4) + scale_fill_brewer(palette="Set1") + theme_excel()+ labs(x="Energy", y="Density") + guides(fill=guide_legend(title="Genres"))+ ggtitle("Energy distribution per Genre")

  • This density plot shows that rock and edm genre have high energy and the genre having least energy is r&b.
ggplot(clean_data, aes(x=loudness, fill=playlist_genre))+ geom_density(alpha=0.4) + scale_fill_brewer(palette="Set2") + labs(x="Loudness", y="Density") + guides(fill=guide_legend(title="Genres"))+ theme_economist()+ ggtitle("Loudness distribution per Genre")

  • From this density plot, we can notice that the distribution is closer to 0 signifying loud music with the loudest genre being edm.
ggplot(raw_data, aes(x=track_popularity, fill=playlist_genre))+ geom_density(alpha=0.2)+ theme_clean() + scale_fill_brewer(palette="Accent")+ labs(x="Track popularity", y="Density") + guides(fill=guide_legend(title="Genres"))+ ggtitle("Track popularity distribution per Genre")

  • This density plot shows that edm has high density at 0 signifying a major part of unpopular songs belonging to this category. Similarly, we can observe that latin songs are highly popular.

Popularity Characteristics

ggplot(data = clean_data) + 
      geom_point(mapping = aes(x = duration_ms, y = track_popularity, color = playlist_genre, alpha = 0.12))

  • The above scatter plot shows that songs with popularity more than 80 are of duration between 150000 ms and 300000 ms.
  • Additionally, there are no pop and rap songs in the dataset that are longer than 500000 ms.
theme_set(theme_bw())  

clean_data$track_popavg <- round((clean_data$track_popularity - mean(clean_data$track_popularity))/sd(clean_data$track_popularity), 2)

clean_data$tp_type <- ifelse(clean_data$track_popavg > 0, "above", "below")

clean_data <- clean_data[order(clean_data$track_popavg), ]

ggplot(clean_data, aes(x= playlist_subgenre, y=track_popavg, label=track_popavg)) + 
  geom_bar(stat='identity', aes(fill=tp_type), width=.5)  +
  scale_fill_manual(name="Popularity", 
                    labels = c("Above Average", "Below Average"), 
                    values = c("above"="purple", "below"="black")) + 
  labs(title= "Track Subgenre with respect to its popularity") + 
  coord_flip()

  • Majority of the songs belonging to Prograssive electro house subgenre had below average popularity.
  • The number of songs of sub-genre urban contemporary, post-teen pop, hip-hop are more than the below average numbers.
  • Sub-genre like Latin hip hop, Classic rock and electro-pop have almost equal number of songs in above popularity average and below.
top_10_artists <- clean_data %>% group_by(track_artist, track_popularity) %>% summarise(count = n()) %>% arrange(desc(track_popularity,count))
## `summarise()` has grouped output by 'track_artist'. You can override using the `.groups` argument.
top_10_artists <- top_10_artists %>% filter(track_popularity>=95) %>% select(track_artist, count)
    
df <- as.data.frame(top_10_artists)


colnames(df) <- c("track_artist", "count")
pie <- ggplot(df, aes(x = "", y=count, fill = track_artist)) + geom_bar(width = 1, stat = "identity") + theme(axis.line = element_blank(), 
        plot.title = element_text(hjust=0.5)) + labs(fill="track_artist", x=NULL, y=NULL, title="Pie Chart of track artist with track popularity greater than 98")


pie + coord_polar(theta = "y", start=0)

  • With track popularity value more than 98, Billi Eilish has sung largest number of it.
  • Among, the other 9 artists, number of songs are almost the same.
par(mfrow=c(2,4))

pop_data <- clean_data %>% select(track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence) %>%  filter(track_popularity > 75) %>% arrange(track_popularity)

plot(data = pop_data, track_popularity ~ danceability, col = "blue")
plot(data = pop_data, track_popularity ~ energy, col = "blue")
plot(data = pop_data, track_popularity ~ loudness, col = "blue")
plot(data = pop_data, track_popularity ~ speechiness, col = "blue")
plot(data = pop_data, track_popularity ~ acousticness, col = "blue")
plot(data = pop_data, track_popularity ~ instrumentalness, col = "blue")
plot(data = pop_data, track_popularity ~ liveness, col = "blue")
plot(data = pop_data, track_popularity ~ valence, col = "blue")

  • With majority of the songs of popularity values greater than 75, following are observations:

    • Slightly higher danceability
    • Higher loudness
    • low speechiness
    • low acousticness
    • low instrumentalness
    • low liveliness

In-house Favourites

Shruthi’s Favourites

clean_data$genre_num <- as.numeric(factor(clean_data$playlist_genre))

shruthi_fav_artists = c("Ariana Grande", "Dua Lipa", "Taylor Swift", "The Weekend", "Jonas Brothers", "Justin Bieber", "The Chainsmokers", "Imagine Dragons")

shruthi_fav <- clean_data %>% select(track_artist, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, genre_num, tempo, duration_ms,  playlist_genre) %>%  filter(trimws(track_artist) %in% shruthi_fav_artists)

shruthi_fav_columns <- names(shruthi_fav[c(-1,-14)])

shruthi_pivot <- shruthi_fav %>% 
  pivot_longer(cols = all_of(shruthi_fav_columns)) 

shruthi_pivot %>%
  ggplot(aes(x = name, y = value, color= playlist_genre, size = 10)) +
  #geom_density() +
  geom_jitter(cex = .4, aes(size = 10)) +
  facet_wrap(~name, ncol = 3, scales = 'free') +
  theme_classic() +
  theme(axis.text.x = element_blank()) +
  labs(title = 'Shruthi`s - likings are as follows', x = '', y = '')

  • Through this plot, we can conclude, Shruthi’s music preference has the following characteristics,
    • genre preference - pop > edm > latin > r&b > rap
    • low acousticness
    • slightly high energy
    • very low instrumentalness
    • low liveliness
    • slightly high loudness
    • very less speechiness
    • not much discrimination with respect to danceability, tempo, popularity and valence

Mounica’s Favourites

moun_fav_artists = c("Gloria Estefan", "Major Lazer", "Afrojack", "DJ Snake", "Elvis Presley", "Queen", "J Balvin", "Pink Floyd", "Daniel Powter")

moun_fav <- clean_data %>% select(track_artist, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, genre_num, tempo, duration_ms,  playlist_genre) %>%  filter(trimws(track_artist) %in% moun_fav_artists)

moun_fav_columns <- names(moun_fav[c(-1,-14)])

moun_pivot <- shruthi_fav %>% 
  pivot_longer(cols = all_of(moun_fav_columns)) 

moun_pivot %>%
  ggplot(aes(x = name, y = value, color= playlist_genre, size = 0.5)) +
  #geom_density() +
  geom_jitter(cex = .4, aes(size = 0.5)) +
  facet_wrap(~name, ncol = 3, scales = 'free') +
  theme_classic() +
  theme(axis.text.x = element_blank()) +
  labs(title = 'Mounica`s - likings are as follows', x = '', y = '')

  • Through this plot, we can conclude, Mounica’s music preference has the following characteristics,
    • genre preference - pop > edm > latin > r&b > rap
    • low acousticness
    • slightly high energy
    • very low instrumentalness
    • low liveliness
    • slightly high loudness
    • very less speechiness
    • not much discrimination with respect to danceability, tempo, popularity and valence

Conclusion:

  • It can be concluded that Mounica and Shruthi have similar taste in music.

Data Modeling

KNN clustering

KNN is a classification technique used to identify the similarities in the data(existing) among various groups which is then used to predict the classification of any new data.

  • Here, it is decided to classify songs based on its popularity solely with respect to its various numerical parameters. Additionally, here, other parameters such as genre, language, artists are not considered.
quantile(clean_data$track_popularity, c(0.33,0.66))
## 33% 66% 
##  31  52
model_data <- clean_data %>%
  mutate(popularity_gp = case_when(
     track_popularity >= 0 & track_popularity <= 31 ~ "Least_Popularity",
                  track_popularity >= 32 & track_popularity <= 52 ~ "Average_Popularity",
                  TRUE ~ "Highest_Popularity"
  )) %>%
  select(where(is.numeric), -c(track_popavg, genre_num, track_popularity,duration_ms), popularity_gp)

model_data$popularity_gp = as.factor(model_data$popularity_gp)
str(model_data)
## 'data.frame':    26229 obs. of  11 variables:
##  $ danceability    : num  0.605 0.716 0.705 0.559 0.498 0.684 0.704 0.555 0.294 0.43 ...
##  $ energy          : num  0.894 0.981 0.975 0.844 0.723 0.85 0.482 0.712 0.929 0.944 ...
##  $ key             : int  9 0 6 11 7 5 5 2 9 6 ...
##  $ loudness        : num  -3.88 -3.61 -3.85 -6.4 -6.74 ...
##  $ speechiness     : num  0.0385 0.031 0.0312 0.0557 0.223 0.109 0.0475 0.0357 0.0425 0.16 ...
##  $ acousticness    : num  1.19e-02 3.17e-02 3.32e-03 1.86e-04 5.08e-01 1.92e-02 3.02e-02 5.78e-02 1.00e-04 2.66e-05 ...
##  $ instrumentalness: num  0.00 7.40e-03 1.82e-04 9.94e-04 5.21e-04 1.37e-04 9.13e-05 4.50e-03 0.00 1.96e-01 ...
##  $ liveness        : num  0.0887 0.171 0.537 0.169 0.419 0.589 0.173 0.128 0.0773 0.235 ...
##  $ valence         : num  0.545 0.861 0.969 0.357 0.318 0.925 0.284 0.288 0.26 0.55 ...
##  $ tempo           : num  122 139 130 128 110 ...
##  $ popularity_gp   : Factor w/ 3 levels "Average_Popularity",..: 3 3 3 3 3 3 3 3 3 3 ...
  • Here, grouping is done based on normalization of song popularity and the classification is as follows:
    • Popularity value between 0 - 31 : Least Popularity (33 percentile)
    • Popularity value between 32 - 52 : Average Popularity (66 percentile)
    • Popularity value between 53 and more : Highest Popularity
  • KNN classification of songs are done based on this grouping.
table(model_data$popularity_gp)
## 
## Average_Popularity Highest_Popularity   Least_Popularity 
##               8695               8717               8817
  • It is noted that the groups have approximately equal proportions of data.
set.seed(3245)
gp <- runif(nrow(model_data))
model_data <- model_data[order(gp),]
head(model_data,5)
##       danceability energy key loudness speechiness acousticness
## 16056        0.480  0.974   5   -5.531      0.0947      0.00754
## 10783        0.706  0.552   6  -10.447      0.1690      0.30500
## 23342        0.811  0.458   5   -9.688      0.0402      0.01930
## 7231         0.725  0.600   1   -7.852      0.4230      0.01070
## 23500        0.486  0.620   4   -7.115      0.0284      0.05170
##       instrumentalness liveness valence   tempo      popularity_gp
## 16056         5.64e-03   0.8950   0.522  96.382   Least_Popularity
## 10783         6.06e-06   0.0968   0.361 165.030 Average_Popularity
## 23342         1.41e-06   0.0526   0.797  92.940 Highest_Popularity
## 7231          0.00e+00   0.3500   0.454 106.223 Average_Popularity
## 23500         8.17e-03   0.0888   0.417 141.147 Average_Popularity
  • The data is disordered here to avoid any accidental sorting over popularity_gp column. Any such sort might end up producing biased data in training and test datasets.
summary(model_data[,-11])
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5620   1st Qu.:0.577000   1st Qu.: 2.000   1st Qu.: -8.337  
##  Median :0.6700   Median :0.720000   Median : 6.000   Median : -6.274  
##  Mean   :0.6536   Mean   :0.696510   Mean   : 5.376   Mean   : -6.838  
##  3rd Qu.:0.7600   3rd Qu.:0.842000   3rd Qu.: 9.000   3rd Qu.: -4.722  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##   speechiness      acousticness    instrumentalness       liveness     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000  
##  1st Qu.:0.0410   1st Qu.:0.0143   1st Qu.:0.0000000   1st Qu.:0.0929  
##  Median :0.0630   Median :0.0815   Median :0.0000221   Median :0.1270  
##  Mean   :0.1087   Mean   :0.1808   Mean   :0.0951242   Mean   :0.1911  
##  3rd Qu.:0.1350   3rd Qu.:0.2670   3rd Qu.:0.0075000   3rd Qu.:0.2480  
##  Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960  
##     valence           tempo       
##  Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:0.3250   1st Qu.: 99.98  
##  Median :0.5070   Median :122.01  
##  Mean   :0.5064   Mean   :121.00  
##  3rd Qu.:0.6900   3rd Qu.:134.03  
##  Max.   :0.9910   Max.   :239.44
  • Above are the numerical parameters taken as an input for KNN classification. It is very important to rescale or normalize these parameter values before KNN modeling.
normalize <- function(x) {
  return((x - min(x))/(max(x) - min(x)))
}

model_norm <- model_data
model_norm$popularity_gp <- NULL
model_norm <- as.data.frame(lapply(model_norm,normalize))
summary(model_norm)
##   danceability        energy            key            loudness     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.5717   1st Qu.:0.5769   1st Qu.:0.1818   1st Qu.:0.7986  
##  Median :0.6816   Median :0.7200   Median :0.5455   Median :0.8418  
##  Mean   :0.6650   Mean   :0.6965   Mean   :0.4888   Mean   :0.8300  
##  3rd Qu.:0.7731   3rd Qu.:0.8420   3rd Qu.:0.8182   3rd Qu.:0.8743  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   speechiness       acousticness     instrumentalness       liveness      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000000   Min.   :0.00000  
##  1st Qu.:0.04466   1st Qu.:0.01439   1st Qu.:0.0000000   1st Qu.:0.09327  
##  Median :0.06863   Median :0.08199   Median :0.0000222   Median :0.12751  
##  Mean   :0.11845   Mean   :0.18194   Mean   :0.0956984   Mean   :0.19189  
##  3rd Qu.:0.14706   3rd Qu.:0.26861   3rd Qu.:0.0075453   3rd Qu.:0.24900  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000000   Max.   :1.00000  
##     valence           tempo       
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.3280   1st Qu.:0.4175  
##  Median :0.5116   Median :0.5096  
##  Mean   :0.5110   Mean   :0.5054  
##  3rd Qu.:0.6963   3rd Qu.:0.5598  
##  Max.   :1.0000   Max.   :1.0000
  • All the parameters are normalized and now they now have values ranging from 0 to 1.
set.seed(123)
train_idx <- sample(nrow(model_norm), .80*nrow(model_norm))

model_train <- model_norm[train_idx,]
model_test <- model_norm[-train_idx,]
model_train_target <- model_data[train_idx,11]
model_test_target <- model_data[-train_idx,11]

sqrt(nrow(model_data))
## [1] 161.9537
  • Here, the data is split into traning (80%) and testing (20%). The respective poppularity_gp column is taken as target.

  • Also, the k-values considered here is 161 which is as per the common rule of taking sqrt of number of rows in the data.

m1 <- knn(train = model_train, test = model_test, cl= model_train_target, k = 161)
levels(m1)
## [1] "Average_Popularity" "Highest_Popularity" "Least_Popularity"
  • KNN Model is run and the m1 variable has prediction for the classification of all of the values in the test dataframe.
cm = as.matrix(table(model_test_target, m1))
cm
##                     m1
## model_test_target    Average_Popularity Highest_Popularity Least_Popularity
##   Average_Popularity                425                870              447
##   Highest_Popularity                367               1041              277
##   Least_Popularity                  415                810              594
  • Confusion Matrix is created to see the Actual and Predicted value classifications.
  • As stated in Wikipedia, a confusion matrix is “a special kind of contingency table, with two dimensions (”actual" and “predicted”), and identical sets of “classes” in both dimensions (each combination of dimension and class is a variable in the contingency table)."
sum(diag(cm))/length(model_test_target)
## [1] 0.3926801
  • Model Accuracy comes out to be just 39%

Conclusion

  • Thus, the classification of popularity of the songs not only depends on the above numerical parameters but also more on the artists, genre and listener’s demographics. It cannot be easily predicted solely based on its numerical parameters.

Time Series on Rock Music

Will Rock Music Dominate the Industry in the 2020s

Rock Music had its share of decline in popularity in 2010s. Spotify data shows almost similar trend. Here is a time series analysis especially on Rock music’s popularity over the years and a forecast of it for next two years with respect to the songs present in this dataset.

genre_year = clean_data %>%
  filter(track_album_release_year > '1985') %>% 
  group_by(playlist_genre, track_album_release_year) %>% 
  summarise(avg_popularity = c(mean(track_popularity)), .groups = 'drop')

genre_year$track_album_release_year <- as.numeric(as.character(genre_year$track_album_release_year))

ggplot(genre_year, aes(x = track_album_release_year, y = avg_popularity)) + 
  geom_line(aes(color = playlist_genre), size = 1) +
  theme_minimal()

  • Above is time series plot for popularity of songs grouped by various genre from 1985 to 2020.
rock_popularity <- genre_year %>% 
  filter(playlist_genre == 'rock' & track_album_release_year > '1975') %>% 
  select(track_album_release_year, avg_popularity)

ts_data <- rock_popularity$avg_popularity
plot.ts(ts_data)

  • The plot seems to have non-constant variance and non-constant mean so we use Box Cox to stabilize the variance. However there is no trace of seasonality or trend in the plot.
bc <- BoxCox.ar(ts_data)

bc$lambda[which.max(bc$loglike)]
## [1] 0.7
  • The lambda value is 0.9. Thus both log and sqrt transformations are taken and checked for best fit.
par(mfrow=c(1,2))
plot.ts(sqrt(ts_data))
plot.ts(log(ts_data))

  • The variance is log data works better than sqrt. Thus, log transformation of the time series is chosen.
par(mfrow=c(1,2))
acf(log(ts_data))
pacf(log(ts_data))

  • Checking the ACF and PACF plots to relate the plot to a theoretical time series model. Here, ACF cuts off at lag 5 and PACF cuts off at lag 2. EACF is further taken to get clear idea of the model.
eacf(log(ts_data), ar.max = 3)
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x o o o o o o o  o  o  o 
## 1 x o o o o o o o o o o  o  o  o 
## 2 o o o o o o o o o o o  o  o  o 
## 3 x o o o o o o o o o o  o  o  o
  • Here, EACF shows no autocorrelation at ARMA(1,1). Thus, stationarity can now be checked using Augmented Dickey Fuller test.
adf.test(log(ts_data))
## 
##  Augmented Dickey-Fuller Test
## 
## data:  log(ts_data)
## Dickey-Fuller = -1.8444, Lag order = 3, p-value = 0.6335
## alternative hypothesis: stationary
  • Augmented Dickey-Fuller Test
    • Null hypothesis : The series is non stationary
    • Alternative hypothesis : The series is stationary
  • p-values greater than 0.05 shows that the series is non-stationary. Thus, diff() of the series is taken to make it stationary.
plot.ts(diff(log(ts_data)))

  • The transformed plot appears to be stationary in variance and in mean.
par(mfrow=c(1,2))
acf(diff(log(ts_data)))
pacf(diff(log(ts_data)))

  • ACF and PACF plots shows the series cut off at lag 1 in both. EACF is further considered to finalise a model for transformed data.
eacf(diff(log(ts_data)), ar.max = 3)
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o o o o o o o o o  o  o  o 
## 1 o o o o o o o o o o o  o  o  o 
## 2 x o o o o o o o o o o  o  o  o 
## 3 x o o o o o o o o o o  o  o  o
  • EACF shows better fits for ARIMA(0,1,1), ARIMA(1,1,0) and ARIMA(1,1,1). AIC values of all three models are taken and checked for better fit.
fit1= arima(ts(log(ts_data)), order = c(1,1,1))
coeftest(fit1)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1 -0.29502    0.27348 -1.0788   0.2807
## ma1 -0.42549    0.27312 -1.5579   0.1193
fit1
## 
## Call:
## arima(x = ts(log(ts_data)), order = c(1, 1, 1))
## 
## Coefficients:
##           ar1      ma1
##       -0.2950  -0.4255
## s.e.   0.2735   0.2731
## 
## sigma^2 estimated as 0.02321:  log likelihood = 15.47,  aic = -26.94
fit2= arima(ts(log(ts_data)), order = c(0,1,1))
coeftest(fit2)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ma1 -0.63340    0.12801  -4.948 7.498e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit2
## 
## Call:
## arima(x = ts(log(ts_data)), order = c(0, 1, 1))
## 
## Coefficients:
##           ma1
##       -0.6334
## s.e.   0.1280
## 
## sigma^2 estimated as 0.02395:  log likelihood = 14.94,  aic = -27.88
fit3= arima(ts(log(ts_data)), order = c(1,1,0))
coeftest(fit3)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.57779    0.13766 -4.1972 2.702e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit3
## 
## Call:
## arima(x = ts(log(ts_data)), order = c(1, 1, 0))
## 
## Coefficients:
##           ar1
##       -0.5778
## s.e.   0.1377
## 
## sigma^2 estimated as 0.02436:  log likelihood = 14.71,  aic = -27.42
  • Among models ARIMA(1,1,1), ARIMA(0,1,1) and ARIMA(1,1,0), ARIMA(0,1,1) has the lowest AIC value of -44. Thus, ARIMA(0,1,1) is taken further for residual diagnostics.
checkresiduals(fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 5.6504, df = 6, p-value = 0.4635
## 
## Model df: 1.   Total lags used: 7
  • It is seen that residual plot has constant mean and variance and acf and pacf plot resemble white noise.

  • Ljung-Box test

  • Null hypothesis: There is no auto-correlation between residuals.

  • Alternative hypothesis: There is auto-correlation between residuals.

  • Also, p value is greater than 0.05 and null hypothesis cannot be rejected here. Hence this is in sync with the acf and pacf which shows that the residuals behave like white noise and have no correlation

ARIMA(0,1,1) model is concluded for this time series

fit= Arima(ts(log(ts_data)), order = c(0,1,1))
plot(forecast(fit, h = 5))

  • The above plot shows the popularity forecast for next two years for Rock music with respect to the Spotify data for the next 5 years.

K Means Clustering

K-means is an unsupervised clustering algorithm that takes in input as numeric values and is used to predict numeric or categorical values by grouping the data into similar clusters. Elbow point graph is used to determine the number clusters to use in the k-means function.

clean_data_kmeans <- clean_data %>% select_if(is.numeric)

clean_data_kmeans <- clean_data_kmeans %>% select(-c(track_popavg,genre_num))
clean_data_kmeans_scaled <- scale(clean_data_kmeans)

elbow_point <- function(data, maxCluster = 9) {
  SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
  SSw <- vector()
  for (i in 2:maxCluster) {
    SSw[i] <- sum(kmeans(data, centers = i)$withinss)
  }
  plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}

elbow_point(clean_data_kmeans_scaled)

model1 <- kmeans(clean_data_kmeans_scaled, 6)
model1$size
## [1] 3283 7056 2203 3430 5709 4548
model1$centers
##   track_popularity danceability       energy          key    loudness
## 1       0.06460638   0.43113893 -0.182839316  0.056245862 -0.06263231
## 2       0.74533319   0.51651937  0.184624152 -0.043263129  0.37248101
## 3      -0.46163585   0.07279951  0.481009448 -0.002018671 -0.06099165
## 4       0.18290604  -0.43989632 -1.573390857 -0.052654929 -1.28671078
## 5      -0.08956485  -0.92656470  0.639581342 -0.025514041  0.53360344
## 6      -1.00488830   0.34701741 -0.003682848  0.099235415 -0.20254257
##   speechiness acousticness instrumentalness    liveness    valence       tempo
## 1   2.1231579   0.05544728       -0.3473924  0.09815776  0.1636627  0.02550604
## 2  -0.2725070  -0.12675818       -0.3459895 -0.19867045  0.4918050 -0.15298820
## 3  -0.3630038  -0.49831433        2.7630316 -0.03172469 -0.5123834  0.15399665
## 4  -0.3544612   1.71303417        0.1175860 -0.23430906 -0.5564637 -0.29857333
## 5  -0.2966919  -0.52378768       -0.2788568  0.51181511 -0.5695988  0.48773471
## 6  -0.2942403  -0.23642141       -0.2894667 -0.21302026  0.5017172 -0.24271706
##   duration_ms
## 1 -0.18451500
## 2 -0.32192833
## 3  0.42616101
## 4 -0.10319469
## 5 -0.01669151
## 6  0.52500131
clean_data$cluster <- model1$cluster
  • The model centers are used to determine the mean of each cluster for that variable.
  • The size parameter is used to denote how many data points have been grouped into that cluster.
fviz_cluster(model1, data=clean_data_kmeans_scaled)

clean_data %>% 
  group_by(cluster) %>% 
  summarise_all(mean) %>% 
  select(cluster, acousticness, danceability, energy, instrumentalness, speechiness, valence, liveness)
## # A tibble: 6 x 8
##   cluster acousticness danceability energy instrumentalness speechiness valence
##     <int>        <dbl>        <dbl>  <dbl>            <dbl>       <dbl>   <dbl>
## 1       1       0.193         0.716  0.663           0.0126      0.328    0.545
## 2       2       0.152         0.729  0.731           0.0129      0.0806   0.622
## 3       3       0.0681        0.664  0.785           0.752       0.0712   0.386
## 4       4       0.568         0.590  0.406           0.123       0.0721   0.376
## 5       5       0.0623        0.519  0.815           0.0289      0.0781   0.373
## 6       6       0.127         0.704  0.696           0.0263      0.0783   0.624
## # ... with 1 more variable: liveness <dbl>
  • Next, we find the mean of each cluster from the original data for the numeric features.

Song Recommendation

  • One of our favorite song is “The Nights” by Avicii. By running the below code we identify that this song belongs to cluster 2.
clean_data %>% 
  filter(track_name == "The Nights", track_artist == "Avicii") %>%
  datatable(options = list(scrollCollapse = TRUE,scrollX = TRUE,
  columnDefs = list(list(className = 'dt-center', targets = 1:4))
  ))
  • From cluster 2, and the genre being pop, 5 tracks are chosen at random with a popularity of +/-5 from the Avicii song. Below 5 recommendations are the predictions of likable songs.
set.seed(123)
clean_data %>% 
  filter(cluster == 2, playlist_genre == "pop", playlist_subgenre=="dance pop", track_popularity >=75 & track_popularity <= 85) %>%
  sample_n(5) %>%
  datatable(options = list(scrollCollapse = TRUE,scrollX = TRUE,
  columnDefs = list(list(className = 'dt-center', targets = 1:4))
  ))

Conclusions

Proposal :

We implemented KNN algorithm to group similar songs based on their popularity , however it is concluded that a songs popularity cannot be determined or derived from the numerical features like energy, danceability etc. , it also depends on the artists, genre and listener’s demographics.

We have also implemented k-means to cluster the songs based on their similarity of numeric features and predicted the songs that match the interest of favourite songs by filtering through the same cluster.

Also, Spotify recently acquired SoundBetter, a music production marketplace and is venturing into producing music on its own. Such clustering analysis can help understand on what parameters the future production songs can be concentrated on which will in turn help widening its customer base.

Limitations

There are 70 million tracks on Spotify till date. Our cleaned data has only 26229 tracks. Using such less sample data for prediction comes with its drawback, not to mention the lack of other variables that depend on demography, language, culture and history of customers.