Streaming services have made listening to music effortless. They provide instant, limitless access to music from all over the world and various genres. They have also changed the way that music artists share their music. Streaming platforms ease the process of recording, releasing and marketing music for the artists. In return, they are able to increase engagement with their fans and release more music than before.
Spotify has quickly become the leading music streaming app with 250 million monthly active users, 50 million tracks and more than 3 million artists. One of the reasons Spotify stands out is because it uses algorithms and machine learning to study the listening habits and music preferences of users, increasing user engagement and revenue. This makes it a good platform for artists to showcase their music and make revenue.
We wanted to analyze the Spotify data set from an artist point of view with the following questions:
1. Who is the most popular artist on Spotify?
2. What is the most popular track on Spotify?
3. What is the most popular genre on Spotify?
5. Is there a correlation between the different track attributes (ex: duration, danceability, energy, etc.)?
6. Is there a correlation between genre and track attributes?
7. Who are the top artists for each genre?
8. What factors affect popularity on Spotify?
Our methodology is to analyze the data set by cleaning and checking for completeness. Then, to explore the relationships between different variables by utilizing various functions in R.
With our analysis, we hope to help artists create music that will increase their popularity and user engagement on Spotify.
We plan to use the following packages for our analysis of the data set:
library(dplyr) # for data manipulation
library(MASS) # for plotting truehistograms
library(DT) # for showing results in datatable
Data Source
The data used for this analysis comes from Spotify’s API and it was created using spotifyr package. The complete data set can be found here: Spotify Data Set
The data set includes data from 1960 to 2020. It has 32833 observations and 23 variables. The following table shows each variable’s name, data type and definition.
Data Import
The following code was used to import the data set and to create a sample view of the data before any data cleaning was performed.
# Import the data set
spotify_raw <- read.csv("D:/BANA/Second_Session/Data Wrangling/Spotify_Project/spotify_songs.csv",stringsAsFactors = FALSE)
attach(spotify_raw)
datatable(
head(spotify_raw,100),
extensions = 'FixedColumns',
options = list(
scrollY = "400px",
scrollX = TRUE,
fixedColumns = TRUE
)
)
Data Structure and Summary
The following shows the structure of the data set. This allows us to see the variable name and data type of each variable. All the variable names and data types look appropriate for the values, so no modifications were made to change the name or the data type.
str(spotify_raw)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "6/14/2019" "12/13/2019" "7/5/2019" "7/19/2019" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
Following is the summary of the data set. The summary helps us determine any anomalies like negative values or any abnormal values that need to be examined further.
summary(spotify_raw)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
Missing Values
There are 15 missing values in the data set overall:
sum(is.na(spotify_raw))
## [1] 15
Below are the number of missing values for each variable:
colSums(is.na(spotify_raw))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
track_name_index = which(is.na(track_name))
track_name_index
## [1] 8152 9283 9284 19569 19812
track_artist_index = which(is.na(track_artist))
track_artist_index
## [1] 8152 9283 9284 19569 19812
track_artist_index = which(is.na(track_album_name))
track_artist_index
## [1] 8152 9283 9284 19569 19812
This means that only 5 observations are missing out of the total 32833 observations (.015%). Since the percentage of the missing values is very low, we decided to remove the observations with missing values.
spotify_clean <- na.omit(spotify_raw)
Duplicate Values
There are no duplicate observations in the data set.
duplicate_values <- duplicated(spotify_raw)
str(duplicate_values)
## logi [1:32833] FALSE FALSE FALSE FALSE FALSE FALSE ...
However, there are duplicate track_id values in the data set that appear under multiple playist_genres which means that these are not true duplicates. Hence, these values were not removed.
duplicate_values <- duplicated(track_id)
summary(duplicate_values)
## Mode FALSE TRUE
## logical 28356 4477
Abnormal Values
Some of the values in the data set were found to have unknown characters such as the following: ?, @, $, etc.
Values with these characters were not removed as we found them to be characters in an another language that couldn’t be translated to English, or they are included in the actual value.
unique_artist <- (spotify_raw[c("track_artist")])
unique_artist
unique_track_album_name <- (spotify_raw[c("track_album_name")])
unique_track_album_name
Histograms were used to check for the distribution of the data:
par(mfrow = c(2,4), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
truehist(track_popularity, h = 10, col = "steelblue")
mtext("track_popularity", side = 1, outer = F, line = 2, cex = 0.8)
truehist(danceability, h = 0.1, col = "steelblue")
mtext( "danceability", side = 1, outer = F, line = 2, cex = 0.8)
truehist(energy, h = 0.1, col = "steelblue")
mtext("energy", side = 1, outer = F, line = 2, cex = 0.8)
truehist(loudness, h = 10, col = "steelblue")
mtext("loudness", side = 1, outer = F, line = 2, cex = 0.8)
truehist(speechiness, h = 0.1, col = "steelblue")
mtext("speechiness", side = 1, outer = F, line = 2, cex = 0.8)
truehist(acousticness, h = 0.1, col = "steelblue")
mtext("acousticness", side = 1, outer = F, line = 2, cex = 0.8)
truehist(tempo, h = 0.1, col = "steelblue")
mtext("tempo", side = 1, outer = F, line = 2, cex = 0.8)
Boxplots were used to check for outliers for each numeric variable. There are very few outliers but they fall close to the range of the rest of the data. Hence, these are not considered as extreme values and have not been removed.
par(mfrow = c(2,6), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
boxplot(track_popularity, col = "steelblue", pch = 19)
mtext("track_popularity", cex = 0.8, side = 1, line = 2)
boxplot(danceability, col = "steelblue", pch = 19)
mtext("danceability", cex = 0.8, side = 1, line = 2)
boxplot(energy, col = "steelblue", pch = 19)
mtext("energy", cex = 0.8, side = 1, line = 2 )
boxplot(key, col = "steelblue", pch = 19)
mtext("key", cex = 0.8, side = 1, line = 2)
boxplot(loudness, col = "steelblue", pch = 19)
mtext("loudness", cex = 0.8, side = 1, line = 2)
boxplot(speechiness, col = "steelblue", pch = 19)
mtext("speechiness", cex = 0.8, side = 1, line = 2)
boxplot(acousticness, col = "steelblue", pch = 19)
mtext("acousticness", cex = 0.8, side = 1, line = 2)
boxplot(instrumentalness, col = "steelblue", pch = 19)
mtext("instrumentalness", cex = 0.8, side = 1, line = 2)
boxplot(liveness, col = "steelblue", pch = 19)
mtext("liveness", cex = 0.8, side = 1, line = 2)
boxplot(valence, col = "steelblue", pch = 19)
mtext("valence", cex = 0.8, side = 1, line = 2)
boxplot(tempo, col = "steelblue", pch = 19)
mtext("tempo", cex = 0.8, side = 1, line = 2)
boxplot(duration_ms, col = "steelblue", pch = 19)
mtext("duration_ms", cex = 0.8, side = 1, line = 2)
After data cleaning, there are now a total of 32828 observations and 23 variables in the data set. The following is the structure of the clean data set.
str(spotify_clean)
## 'data.frame': 32828 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "6/14/2019" "12/13/2019" "7/5/2019" "7/19/2019" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 8152 9283 9284 19569 19812
## ..- attr(*, "names")= chr [1:5] "8152" "9283" "9284" "19569" ...
datatable(
head(spotify_clean,100),
extensions = 'FixedColumns',
options = list(
scrollY = "400px",
scrollX = TRUE,
fixedColumns = TRUE
)
)
For our analysis, the following are the variables of concern: track_name, track_artist, track_popularity, track_album_name, track_album release date, playlist_name, playlist_genre, danceability, energy, loudness, speechiness, acousticness, instruemntalness, liveliness, valence, tempo, and duration_ms.
The variables track_name, track_artist, track_album_name, track_album_release date, playlist_name and playlist_genre are all non-numeric values. Below is a summary for each numeric variable of concern:
track_popularity - mean: 42.48 median: 45.00 min: 0.00 max: 100.00
dancebility - mean: 0.65 median: 0.67 min: 0.00 max: 0.98
energy - mean: 0.69 median: 0.72 min: 0.0001 max: 1.00
loudness - mean: -6.72 median: -6.166 min: -46.448 max: 1.275
speechiness - mean:0.1071 median:0.625 min:0.00 max:0.918
acouticness - mean: 42.48 median: 45.00 min: 0.00 max: 100.00
instrumentalness - mean: 0.1753 median: 0.0804 min: 0.00 max: 0.9940
liveliness - mean : 0.1902 median: 0.1270 min: 0.00 max: 0.9960
valence - mean : 0.5106 median: 0.5120 min: 0.00 max: 0.9910
tempo - mean : 120.88 median: 121.98 min: 0.00 max: 239.44
duration - mean : 225800 median: 216000 min: 4000 max: 517810
Data Analysis
For our analysis, we plan to use visualizations (refer to Plots and Tables section for further detail) to discover popular artist/track/genre on Spotify overall. Next, we will look at the correlation between all the track attributes to see if any two attributes have a strong correlation. From here, we would like to analyze the relationship between different genres and track attributes to uncover if any specific track attributes are always present in popular genres. Then, we would look at the top artists for each genre to see popular artists per genre. With this analysis we hope to discover what factors affect overall popularity on Spotify providing insights to artists when creating new music.
We might consider creating a new variable for the album release year by separating the year from the track_album_release_date and then, further splitting the data by the year and grouping it by decades. This will allow us to see how the popular genres/artists/tracks changed over the decades.
We can summarize our data by using different plots and tables to show relationships between variables (refer to Plots and Tables section for further detail).
Plots and Tables
We plan to use various versions of the barplot() and pie() to answer the following questions:
We also plan to use various aspects of ggplot() and corrplot() to compare the correlation between different variables and answer the following questions:
Questions
Machine Learning Techniques
We need to do further data exploration in order to determine if any machine learning techniques would be beneficial for our analysis. As of now, we are interested in looking into the cluster analysis technique, as this technique allows us to group certain variables in a way that variables in the same group are similar to each other than to those in other groups.
We could utilize this technique to see if any of the song attributes are similar to other certain attributes and can be grouped together. This an help us see patterns in which song attributes correlate to artist/track popularity.