Introduction

Streaming services have made listening to music effortless. They provide instant, limitless access to music from all over the world and various genres. They have also changed the way that music artists share their music. Streaming platforms ease the process of recording, releasing and marketing music for the artists. In return, they are able to increase engagement with their fans and release more music than before.

Spotify has quickly become the leading music streaming app with 250 million monthly active users, 50 million tracks and more than 3 million artists. One of the reasons Spotify stands out is because it uses algorithms and machine learning to study the listening habits and music preferences of users, increasing user engagement and revenue. This makes it a good platform for artists to showcase their music and make revenue.

We wanted to analyze the Spotify data set from an artist point of view with the following questions:

1. Who is the most popular artist on Spotify? 
2. What is the most popular track on Spotify?
3. What is the most popular genre on Spotify?
5. Is there a correlation between the different track attributes (ex: duration, danceability, energy, etc.)?
6. Is there a correlation between genre and track attributes? 
7. Who are the top artists for each genre?
8. What factors affect popularity on Spotify?

Our methodology is to analyze the data set by cleaning and checking for completeness. Then, to explore the relationships between different variables by utilizing various functions in R.

With our analysis, we hope to help artists create music that will increase their popularity and user engagement on Spotify.

Required Packages

We plan to use the following packages for our analysis of the data set:

  • dpylr - used to manipulate data
  • tidyverse - Will be used to define data structures
  • DT - used to summarize results in data tables
  • MASS - used to plot true histograms for dance attributes
  • ggplot - will be used to analyze correlation between various variables
  • corrplot - will be used to analyze correlation between various variables
library(dplyr) # for data manipulation
library(MASS) # for plotting truehistograms
library(DT) # for showing results in datatable

Data Preperation

Data Source and Data Import

Data Source

The data used for this analysis comes from Spotify’s API and it was created using spotifyr package. The complete data set can be found here: Spotify Data Set

The data set includes data from 1960 to 2020. It has 32833 observations and 23 variables. The following table shows each variable’s name, data type and definition.

Data Import

The following code was used to import the data set and to create a sample view of the data before any data cleaning was performed.

# Import the data set 
spotify_raw <- read.csv("D:/BANA/Second_Session/Data Wrangling/Spotify_Project/spotify_songs.csv",stringsAsFactors = FALSE)
attach(spotify_raw)

datatable(
  head(spotify_raw,100),
  extensions = 'FixedColumns',
  options = list(
    scrollY = "400px",
    scrollX = TRUE,
    fixedColumns = TRUE
  )
)

Data Cleaning

Data Structure and Summary

The following shows the structure of the data set. This allows us to see the variable name and data type of each variable. All the variable names and data types look appropriate for the values, so no modifications were made to change the name or the data type.

str(spotify_raw)
## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "6/14/2019" "12/13/2019" "7/5/2019" "7/19/2019" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

Following is the summary of the data set. The summary helps us determine any anomalies like negative values or any abnormal values that need to be examined further.

summary(spotify_raw)
##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

Missing Values

There are 15 missing values in the data set overall:

sum(is.na(spotify_raw))
## [1] 15

Below are the number of missing values for each variable:

  • track_name - 5
  • track_artist - 5
  • track_album_name - 5
colSums(is.na(spotify_raw))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
track_name_index = which(is.na(track_name))
track_name_index
## [1]  8152  9283  9284 19569 19812
track_artist_index = which(is.na(track_artist))
track_artist_index
## [1]  8152  9283  9284 19569 19812
track_artist_index = which(is.na(track_album_name))
track_artist_index
## [1]  8152  9283  9284 19569 19812

This means that only 5 observations are missing out of the total 32833 observations (.015%). Since the percentage of the missing values is very low, we decided to remove the observations with missing values.

spotify_clean <- na.omit(spotify_raw)

Duplicate Values

There are no duplicate observations in the data set.

duplicate_values <- duplicated(spotify_raw)
str(duplicate_values)
##  logi [1:32833] FALSE FALSE FALSE FALSE FALSE FALSE ...

However, there are duplicate track_id values in the data set that appear under multiple playist_genres which means that these are not true duplicates. Hence, these values were not removed.

duplicate_values <- duplicated(track_id)
summary(duplicate_values)
##    Mode   FALSE    TRUE 
## logical   28356    4477

Abnormal Values

Some of the values in the data set were found to have unknown characters such as the following: ?, @, $, etc.

Values with these characters were not removed as we found them to be characters in an another language that couldn’t be translated to English, or they are included in the actual value.

unique_artist <- (spotify_raw[c("track_artist")])
unique_artist

unique_track_album_name <- (spotify_raw[c("track_album_name")])
unique_track_album_name

Histograms were used to check for the distribution of the data:

  • The following variables appear to have right skewed data: speechiness and acousticness.
  • The following variables appear to have slightly left skewed data: track_popularity, energy, tempo and loudness
  • The danceability variable has approximately a normal distribution.
par(mfrow = c(2,4), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)

truehist(track_popularity, h = 10, col = "steelblue")
mtext("track_popularity", side = 1, outer = F, line = 2, cex = 0.8)

truehist(danceability, h = 0.1, col = "steelblue")
mtext( "danceability", side = 1, outer = F, line = 2, cex = 0.8)

truehist(energy, h = 0.1, col = "steelblue")
mtext("energy", side = 1, outer = F, line = 2, cex = 0.8)

truehist(loudness, h = 10, col = "steelblue")
mtext("loudness", side = 1, outer = F, line = 2, cex = 0.8)

truehist(speechiness, h = 0.1, col = "steelblue")
mtext("speechiness", side = 1, outer = F, line = 2, cex = 0.8)

truehist(acousticness, h = 0.1, col = "steelblue")
mtext("acousticness", side = 1, outer = F, line = 2, cex = 0.8)

truehist(tempo, h = 0.1, col = "steelblue")
mtext("tempo", side = 1, outer = F, line = 2, cex = 0.8)

Boxplots were used to check for outliers for each numeric variable. There are very few outliers but they fall close to the range of the rest of the data. Hence, these are not considered as extreme values and have not been removed.

par(mfrow = c(2,6), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)

boxplot(track_popularity, col = "steelblue", pch = 19)
mtext("track_popularity", cex = 0.8, side = 1, line = 2)

boxplot(danceability, col = "steelblue", pch = 19)
mtext("danceability", cex = 0.8, side = 1, line = 2)

boxplot(energy, col = "steelblue", pch = 19)
mtext("energy", cex = 0.8, side = 1, line = 2 )

boxplot(key, col = "steelblue", pch = 19)
mtext("key", cex = 0.8, side = 1, line = 2)


boxplot(loudness, col = "steelblue", pch = 19)
mtext("loudness", cex = 0.8, side = 1, line = 2)

boxplot(speechiness, col = "steelblue", pch = 19)
mtext("speechiness", cex = 0.8, side = 1, line = 2)

boxplot(acousticness, col = "steelblue", pch = 19)
mtext("acousticness", cex = 0.8, side = 1, line = 2)

boxplot(instrumentalness, col = "steelblue", pch = 19)
mtext("instrumentalness", cex = 0.8, side = 1, line = 2)

boxplot(liveness, col = "steelblue", pch = 19)
mtext("liveness", cex = 0.8, side = 1, line = 2)

boxplot(valence, col = "steelblue", pch = 19)
mtext("valence", cex = 0.8, side = 1, line = 2)

boxplot(tempo, col = "steelblue", pch = 19)
mtext("tempo", cex = 0.8, side = 1, line = 2)

boxplot(duration_ms, col = "steelblue", pch = 19)
mtext("duration_ms", cex = 0.8, side = 1, line = 2)

Clean Data Set

After data cleaning, there are now a total of 32828 observations and 23 variables in the data set. The following is the structure of the clean data set.

str(spotify_clean)
## 'data.frame':    32828 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "6/14/2019" "12/13/2019" "7/5/2019" "7/19/2019" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 8152 9283 9284 19569 19812
##   ..- attr(*, "names")= chr [1:5] "8152" "9283" "9284" "19569" ...
datatable(
  head(spotify_clean,100),
  extensions = 'FixedColumns',
  options = list(
    scrollY = "400px",
    scrollX = TRUE,
    fixedColumns = TRUE
  )
)

Variables of Concern

For our analysis, the following are the variables of concern: track_name, track_artist, track_popularity, track_album_name, track_album release date, playlist_name, playlist_genre, danceability, energy, loudness, speechiness, acousticness, instruemntalness, liveliness, valence, tempo, and duration_ms.

The variables track_name, track_artist, track_album_name, track_album_release date, playlist_name and playlist_genre are all non-numeric values. Below is a summary for each numeric variable of concern:

  • track_popularity - mean: 42.48 median: 45.00 min: 0.00 max: 100.00

  • dancebility - mean: 0.65 median: 0.67 min: 0.00 max: 0.98

  • energy - mean: 0.69 median: 0.72 min: 0.0001 max: 1.00

  • loudness - mean: -6.72 median: -6.166 min: -46.448 max: 1.275

  • speechiness - mean:0.1071 median:0.625 min:0.00 max:0.918

  • acouticness - mean: 42.48 median: 45.00 min: 0.00 max: 100.00

  • instrumentalness - mean: 0.1753 median: 0.0804 min: 0.00 max: 0.9940

  • liveliness - mean : 0.1902 median: 0.1270 min: 0.00 max: 0.9960

  • valence - mean : 0.5106 median: 0.5120 min: 0.00 max: 0.9910

  • tempo - mean : 120.88 median: 121.98 min: 0.00 max: 239.44

  • duration - mean : 225800 median: 216000 min: 4000 max: 517810

Proposed Exploratory Data Analysis

Data Analysis

For our analysis, we plan to use visualizations (refer to Plots and Tables section for further detail) to discover popular artist/track/genre on Spotify overall. Next, we will look at the correlation between all the track attributes to see if any two attributes have a strong correlation. From here, we would like to analyze the relationship between different genres and track attributes to uncover if any specific track attributes are always present in popular genres. Then, we would look at the top artists for each genre to see popular artists per genre. With this analysis we hope to discover what factors affect overall popularity on Spotify providing insights to artists when creating new music.

We might consider creating a new variable for the album release year by separating the year from the track_album_release_date and then, further splitting the data by the year and grouping it by decades. This will allow us to see how the popular genres/artists/tracks changed over the decades.

We can summarize our data by using different plots and tables to show relationships between variables (refer to Plots and Tables section for further detail).

Plots and Tables

We plan to use various versions of the barplot() and pie() to answer the following questions:

  • Who is the most popular artist on Spotify?
  • What is the most popular track on Spotify?
  • What is the most popular genre on Spotify?

We also plan to use various aspects of ggplot() and corrplot() to compare the correlation between different variables and answer the following questions:

  • Who are the top artists for each genre?
  • Is there a correlation between the different track attributes (ex: duration, danceability, energy, etc.)?
  • Is there a correlation between genre and track attributes?
  • What factors affect popularity on Spotify?

Questions

  1. Is corrplot() the best plot to use to show correlation between all the song attributes?
  2. What is the best plot or table to use to show top artists per genre?
  3. What is the best way to separate year from the album_release_date column, without creating any NAs?

Machine Learning Techniques

We need to do further data exploration in order to determine if any machine learning techniques would be beneficial for our analysis. As of now, we are interested in looking into the cluster analysis technique, as this technique allows us to group certain variables in a way that variables in the same group are similar to each other than to those in other groups.

We could utilize this technique to see if any of the song attributes are similar to other certain attributes and can be grouped together. This an help us see patterns in which song attributes correlate to artist/track popularity.