Spotify Data Analysis for Music Artists

Introduction

Streaming services have made listening to music effortless. They provide instant, limitless access to music from all over the world and various genres. They have also changed the way that music artists share their music. Streaming platforms ease the process of recording, releasing and marketing music for the artists. In return, they are able to increase engagement with their fans and release more music than before.

Spotify has quickly become the leading music streaming app with 250 million monthly active users, 50 million tracks and more than 3 million artists. One of the reasons Spotify stands out is because it uses algorithms and machine learning to study the listening habits and music preferences of users, increasing user engagement and revenue. This makes it a good platform for artists to showcase their music and make revenue.

We wanted to analyze the Spotify data set from an artist point of view with the following questions:

1. Who is the most popular artist on Spotify? 
2. What is the most popular track on Spotify?
3. What is the most popular genre on Spotify?
5. Is there a correlation between the different track attributes (ex: duration, danceability, energy, etc.)?
6. Is there a correlation between genre and track attributes? 
7. Who are the top artists for each genre?
8. What factors affect popularity on Spotify?

Our methodology is to analyze the data set by cleaning and checking for completeness. Then, to explore the relationships between different variables by utilizing various functions in R.

With our analysis, we hope to help artists create music that will increase their popularity and user engagement on Spotify.

Required Packages

We plan to use the following packages for our analysis of the data set:

dpylr - used to manipulate data
tidyverse - used to define data structures
DT - used to summarize results in data tables
MASS - used to plot true histograms for dance attributes
ggplot2 - used to create visualizations for our analysis
treemap - used to create treemap visualization
formattable - used to format table outputs
GGally - used to analyze correlation between track atrributes

library(dplyr) 
library(tidyverse)
library(DT) 
library(MASS) 
library(ggplot2)
library(treemap)
library(formattable)
library(GGally)

Data Preperation

Data Source and Data Import

Data Source

The data used for this analysis comes from Spotify’s API and it was created using spotifyr package. The complete data set can be found here: Spotify Data Set

The data set includes data from 1960 to 2020. It has 32833 observations and 23 variables. The following table shows each variable’s name, data type and definition.

Data Import

The following code was used to import the data set and to create a sample view of the data before any data cleaning was performed.

# Import the data set 
spotify_raw <- read.csv("D:/BANA/Second_Session/Data Wrangling/Spotify_Project/spotify_songs.csv",stringsAsFactors = FALSE)
attach(spotify_raw)

datatable(
  head(spotify_raw,100),
  extensions = 'FixedColumns',
  options = list(
    scrollY = "400px",
    scrollX = TRUE,
    fixedColumns = TRUE
  )
)

Data Cleaning

Data Structure and Summary

The following shows the structure of the data set. This allows us to see the variable name and data type of each variable. All the variable names and data types look appropriate for the values, so no modifications were made to change the name or the data type.

str(spotify_raw)

## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "6/14/2019" "12/13/2019" "7/5/2019" "7/19/2019" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

Following is the summary of the data set. The summary helps us determine any anomalies like negative values or any abnormal values that need to be examined further.

summary(spotify_raw)

##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

Missing Values

There are 15 missing values in the data set overall:

sum(is.na(spotify_raw))

## [1] 15

Below are the number of missing values for each variable:

track_name - 5
track_artist - 5
track_album_name - 5

colSums(is.na(spotify_raw))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

track_name_index = which(is.na(track_name))
track_name_index

## [1]  8152  9283  9284 19569 19812

track_artist_index = which(is.na(track_artist))
track_artist_index

## [1]  8152  9283  9284 19569 19812

track_artist_index = which(is.na(track_album_name))
track_artist_index

## [1]  8152  9283  9284 19569 19812

This means that only 5 observations are missing out of the total 32833 observations (.015%). Since the percentage of the missing values is very low, we decided to remove the observations with missing values.

spotify_clean <- na.omit(spotify_raw)

Duplicate Values

There are no duplicate observations in the data set.

duplicate_values <- duplicated(spotify_raw)
str(duplicate_values)

##  logi [1:32833] FALSE FALSE FALSE FALSE FALSE FALSE ...

However, there are duplicate track_id values in the data set that appear under multiple playist_genres which means that these are not true duplicates. Hence, these values were not removed.

duplicate_values <- duplicated(track_id)
summary(duplicate_values)

##    Mode   FALSE    TRUE 
## logical   28356    4477

Abnormal Values

Some of the values in the data set were found to have unknown characters such as the following: ?, @, $, etc.

Values with these characters were not removed as we found them to be characters in an another language that couldn’t be translated to English, or they are included in the actual value.

unique_artist <- (spotify_raw[c("track_artist")])
unique_artist

unique_track_album_name <- (spotify_raw[c("track_album_name")])
unique_track_album_name

Histograms were used to check for the distribution of the data:

The following variables appear to have right skewed data: speechiness and acousticness.
The following variables appear to have slightly left skewed data: track_popularity, energy, tempo and loudness
The danceability variable has approximately a normal distribution.

par(mfrow = c(2,4), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)

truehist(track_popularity, h = 10, col = "steelblue")
mtext("track_popularity", side = 1, outer = F, line = 2, cex = 0.8)

truehist(danceability, h = 0.1, col = "steelblue")
mtext( "danceability", side = 1, outer = F, line = 2, cex = 0.8)

truehist(energy, h = 0.1, col = "steelblue")
mtext("energy", side = 1, outer = F, line = 2, cex = 0.8)

truehist(loudness, h = 10, col = "steelblue")
mtext("loudness", side = 1, outer = F, line = 2, cex = 0.8)

truehist(speechiness, h = 0.1, col = "steelblue")
mtext("speechiness", side = 1, outer = F, line = 2, cex = 0.8)

truehist(acousticness, h = 0.1, col = "steelblue")
mtext("acousticness", side = 1, outer = F, line = 2, cex = 0.8)

truehist(tempo, h = 0.1, col = "steelblue")
mtext("tempo", side = 1, outer = F, line = 2, cex = 0.8)

Boxplots were used to check for outliers for each numeric variable. There are very few outliers but they fall close to the range of the rest of the data. Hence, these are not considered as extreme values and have not been removed.

par(mfrow = c(2,6), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)

boxplot(track_popularity, col = "steelblue", pch = 19)
mtext("track_popularity", cex = 0.8, side = 1, line = 2)

boxplot(danceability, col = "steelblue", pch = 19)
mtext("danceability", cex = 0.8, side = 1, line = 2)

boxplot(energy, col = "steelblue", pch = 19)
mtext("energy", cex = 0.8, side = 1, line = 2 )

boxplot(key, col = "steelblue", pch = 19)
mtext("key", cex = 0.8, side = 1, line = 2)

boxplot(loudness, col = "steelblue", pch = 19)
mtext("loudness", cex = 0.8, side = 1, line = 2)

boxplot(speechiness, col = "steelblue", pch = 19)
mtext("speechiness", cex = 0.8, side = 1, line = 2)

boxplot(acousticness, col = "steelblue", pch = 19)
mtext("acousticness", cex = 0.8, side = 1, line = 2)

boxplot(instrumentalness, col = "steelblue", pch = 19)
mtext("instrumentalness", cex = 0.8, side = 1, line = 2)

boxplot(liveness, col = "steelblue", pch = 19)
mtext("liveness", cex = 0.8, side = 1, line = 2)

boxplot(valence, col = "steelblue", pch = 19)
mtext("valence", cex = 0.8, side = 1, line = 2)

boxplot(tempo, col = "steelblue", pch = 19)
mtext("tempo", cex = 0.8, side = 1, line = 2)

boxplot(duration_ms, col = "steelblue", pch = 19)
mtext("duration_ms", cex = 0.8, side = 1, line = 2)

Clean Data Set

After data cleaning, there are now a total of 32828 observations and 23 variables in the data set. The following is the structure of the clean data set.

str(spotify_clean)

## 'data.frame':    32828 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "6/14/2019" "12/13/2019" "7/5/2019" "7/19/2019" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 8152 9283 9284 19569 19812
##   ..- attr(*, "names")= chr [1:5] "8152" "9283" "9284" "19569" ...

datatable(
  head(spotify_clean,100),
  extensions = 'FixedColumns',
  options = list(
    scrollY = "400px",
    scrollX = TRUE,
    fixedColumns = TRUE
  )
)

Variables of Concern

For our analysis, the following are the variables of concern: track_name, track_artist, track_popularity, track_album_name, track_album release date, playlist_name, playlist_genre, danceability, energy, loudness, speechiness, acousticness, instruemntalness, liveliness, valence, tempo, and duration_ms.

The variables track_name, track_artist, track_album_name, track_album_release date, playlist_name and playlist_genre are all non-numeric values. Below is a summary for each numeric variable of concern:

track_popularity - mean: 42.48 median: 45.00 min: 0.00 max: 100.00
dancebility - mean: 0.65 median: 0.67 min: 0.00 max: 0.98
energy - mean: 0.69 median: 0.72 min: 0.0001 max: 1.00
loudness - mean: -6.72 median: -6.166 min: -46.448 max: 1.275
speechiness - mean: 0.1071 median: 0.625 min: 0.00 max: 0.918
acouticness - mean: 42.48 median: 45.00 min: 0.00 max: 100.00
instrumentalness - mean: 0.1753 median: 0.0804 min: 0.00 max: 0.9940
liveliness - mean: 0.1902 median: 0.1270 min: 0.00 max: 0.9960
valence - mean: 0.5106 median: 0.5120 min: 0.00 max: 0.9910
tempo - mean: 120.88 median: 121.98 min: 0.00 max: 239.44
duration - mean: 225800 median: 216000 min: 4000 max: 517810

Exploratory Data Analysis

Proposed Data Analysis

For our analysis, we plan to use visualizations (refer to Plots and Tables section for further detail) to discover popular artist/track/genre on Spotify overall. Next, we will look at the correlation between all the track attributes to see if any two attributes have a strong correlation. From here, we would like to analyze the relationship between different genres and track attributes to uncover if any specific track attributes are always present in popular genres. Then, we would look at the top artists for each genre to see popular artists per genre. With this analysis we hope to discover what factors affect overall popularity on Spotify providing insights to artists when creating new music.

We can summarize our data by using different plots and tables to show relationships between variables (refer to Plots and Tables section for further detail).

Plots and Tables

We plan to use various versions of the barplot() and pie() to answer the following questions:

Who is the most popular artist on Spotify?
What is the most popular track on Spotify?
What is the most popular genre on Spotify?

We also plan to use various aspects of ggplot() and corrplot() to compare the correlation between different variables and answer the following questions:

Who are the top artists for each genre?
Is there a correlation between the different track attributes (ex: duration, danceability, energy, etc.)?
Is there a correlation between genre and track attributes?
What factors affect popularity on Spotify?

Popularity Analysis

The following analysis was done to see what factors affect popularity on Spotify. First, we did an initial analysis to find popular artist, track and genre. Then, we analyzed the relationship between track attributes and popular songs.

Popular Artist

The most popular artist on Spotify was found by looking for artists with the most number of tracks. The following data table shows the top 10 artists with the most number of songs.

spotify_clean %>% 
  group_by(track_artist) %>% distinct(track_name) %>%
  tally() %>% 
  arrange(desc(n)) %>% 
  top_n(10,n) %>%
  formattable(align = c("l"), col.names = c('Track Artist', 'Number of Tracks'))

Track Artist	Number of Tracks
Queen	111
Martin Garrix	73
David Guetta	64
Logic	62
Hardwell	61
Don Omar	59
Dimitri Vegas & Like Mike	56
The Chainsmokers	56
Drake	47
Calvin Harris	46

This can also be visualized in the following bar chart.

spotify_clean %>% 
  group_by(track_artist) %>% distinct(track_name) %>%
  tally() %>% 
  arrange(desc(n)) %>% 
  top_n(10,n) %>%

#Creating Bar Chart
ggplot() +
  geom_col(aes(x = reorder(track_artist, -n), y = n, fill = track_artist),colour = "black") + 
scale_fill_brewer() +
#Formatting Title/Axis Labels
  ggtitle("Number of Tracks by Artist") + 
  xlab("Track Artist") + 
  ylab("Number of Tracks") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = -45, hjust = 0, vjust = 1)) +
  theme(legend.position = "none", plot.background = element_rect(fill = "white"))

The artist with the most number of tracks on Spotify is Queen with 111 songs followed by Martin Garrix, David Guetta, Logic, Hardwell, Don Omar, Dimitri Vegas and Like Mike, The Chainsmokers, Drake, Calvin Harris.

Popular Track

The most popular track on Spotify was found filtering by the track_popularity variable in the data set. The track_popularity variable shows the popularity of a track from 0-100 where 100 is the highest.

We considered any track having a track_popularity value greater than 90 as popular. The following table shows a list of these tracks:

  popular_track <- spotify_clean %>%
  filter(track_popularity >= 90) %>%
  arrange(desc(track_popularity)) %>% 
  distinct(track_name, track_popularity)

datatable(
  head(popular_track,10),
  extensions = 'FixedColumns',
  options = list(
    scrollY = "400px",
    scrollX = TRUE,
    fixedColumns = TRUE
  )
)

The most popular track on Spotify is Dance Monkey as it is the only track with a track_popularity value of 100. Some other tracks in the top 10 include: Roxanne, Tusa, Memories, Blinding Lights, Circles, The Box, everything i wanted, Don’t start Now and Falling.

Popular Genre

The most popular genre on Spotify was found by analyzing the number of tracks per genre. The following data table shows the number of tracks per genre.

  spotify_clean %>% 
  count(playlist_genre) %>%
  formattable(align = c("l"), col.names = c('Playlist Genre', 'Number of Tracks'))

Playlist Genre	Number of Tracks
edm	6043
latin	5153
pop	5507
r&b	5431
rap	5743
rock	4951

This can also be visualized in the following bar chart.

spotify_clean %>% count(playlist_genre) %>%
ggplot() +
  geom_col(aes(x = reorder(playlist_genre, -n), y = n, fill = playlist_genre),colour = "black") +   
  scale_fill_brewer() +
  #Formatting Title/Axis Labels
  ggtitle("Number of Tracks by Genre") + 
  xlab("Playlist Genre") + 
  ylab("Number of Tracks") +
  theme_classic() +
  theme(plot.background = element_rect(fill = "white"), legend.position = "none")

The most popular genre on Spotify is EDM with 6043 followed by rap, pop, r&b, latin and rock.

Track Attributes Frequency in Popular Songs

After analyzing the most popular artist, track and genre on Spotify, we wanted to see if certain track attributes were more frequent in popular songs.

This analysis was done by only looking at tracks with a track_popularity value of 90 or greater. Then, the frequency was analyzed using box plots and jitter plots created for each attribute.

#Pulling the needed variables for analysis
song_attributes2 <- names(spotify_clean)[c(12:13, 15, 17:23)]

popularity_comp <- spotify_clean %>% 
  filter(track_popularity >= 90) %>%
  pivot_longer(cols = song_attributes2) 

#creating plots, formatting
popularity_comp %>%
  ggplot(aes(x = name, y = value)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(cex = .4, color='cornflowerblue') +
  facet_wrap(~name, ncol = 3, scales = 'free') +
  theme_classic() +
  theme(axis.text.x = element_blank()) +
  labs(title = 'Track Attributes Frequency for Popular Songs', x = '', y = '')

The following observations were made from this analysis:

Most of the popular songs on Spotify are low in acouticness, liveliness,instrumentalness and speechiness
Most of the popular songs on Spotify are high in danceability
Most of the popular songs on Spotify are louder and have a shorter duration

Track Attribute Analysis

A correlation plot was used in order to analyze the correlation between all the track attributes.

corr_plot_data <- spotify_clean %>% 
  
  dplyr::select(track_popularity, danceability, energy, loudness, speechiness,acousticness,instrumentalness, liveness, valence, tempo, duration_ms)

  ggcorr(spotify_clean, hjust = .85, size = 3, nbreaks=10, palette='YlGnBu', label=TRUE, label_size= 3.5, label_color='white', layout.exp=2)

The following observations were made from the correlation plot:

Energy and Loudness have a high positive correlation between each other
Although not significant, danceability and valence have a positive correlation
Acousticness and energy have a high negative correaltion between each other
Although not significant, acousticness and loudness have a negative correlation

Genre Analysis

Top Artists per Genre

The following tree map was used to find the top artist in each genre.

top__artists_per_genre <- spotify_clean %>% group_by(playlist_genre,track_artist) %>% summarise(n = n()) %>% top_n(5, n)

top_artists_per_genre_treemap <- treemap(top__artists_per_genre, index = c("playlist_genre", "track_artist"), vSize = "n",palette = "Blues",  title = "Top 5 Track Artists for each Playlist Genre")

The following observations were made from the tree map:

The top artist in the EDM genre is Martin Garrix
The top artist in the Latin genre is Don Omar
The top artist in the Rap genre is Logic
The top artist in the Rock genre is Queen
The top artists in the Pop genre are The Chainsmokers
The top artist in the R&B genre are Bobby Brown

Track Attributes by Genre

The following jitter plots were used to analyze the relationship between different genres and track attributes.

ggplot(spotify_clean, aes(playlist_genre, acousticness)) +
  geom_boxplot() +
  geom_jitter(width = .25, alpha = .5,color='darkgray')

EDM, Rock tracks are low in acouticness.

ggplot(spotify_clean, aes(playlist_genre, duration_ms)) +
  geom_boxplot() +
  geom_jitter(width = .25, alpha = .5,color='darkgray')

Duration_ms is neither low nor high for any track.

ggplot(spotify_clean, aes(playlist_genre, energy)) +
  geom_boxplot() +
  geom_jitter(width = .25, alpha = .5,color='darkgray')

EDM ,Latin , Pop, Rap, Rock tracks are high in energy and loudness while they are low in acouticness, liveliness and speecihiness

ggplot(spotify_clean, aes(playlist_genre, liveness)) +
  geom_boxplot() +
  geom_jitter(width = .25, alpha = .5,color='darkgray')

EDM, Rap, Rock tracks low in speechiness

ggplot(spotify_clean, aes(playlist_genre, instrumentalness)) +
  geom_boxplot() +
  geom_jitter(width = .25, alpha = .5,color='darkgray')

Rap track is low instrumentalness.

ggplot(spotify_clean, aes(playlist_genre, loudness)) +
  geom_boxplot() +
  geom_jitter(width = .25, alpha = .5,color='darkgray')

EDM ,Latin , Pop, R&B, Rap, Rock tracks are high in loudness.

ggplot(spotify_clean, aes(playlist_genre, speechiness)) +
  geom_boxplot() +
  geom_jitter(width = .25, alpha = .5,color='darkgray')

EDM , Rap, Rock tracks are low in speechiness.

ggplot(spotify_clean, aes(playlist_genre, tempo)) +
  geom_boxplot() +
  geom_jitter(width = .25, alpha = .5,color='darkgray')

Tempo is neither low nor high for any track.

ggplot(spotify_clean, aes(playlist_genre, valence)) +
  geom_boxplot() +
  geom_jitter(width = .25, alpha = .5,color='darkgray')

EDM and Latin are low in valence

The following observations were made from this analysis:

EDM tracks are high in energy, loudness while they are low in speechiness, liveliness, acousticness and valence
Latin tracks are high in danceability, energy, loudness and valence
Pop tracks are high in danceability, energy and loudness
R&B tracks are high in danceability and loudness
Rap tracks are high in danceability, energy and loudness while they are low in acousticness, instrumentalness, liveliness and speechiness,
Rock tracks are high in energy and loudness while they are low in acouticness, liveliness and speecihiness

Summary

The goal of this analysis was to find factors that affect popularity on Spotify in order to provide insight to music artists. This will allow artists to create tracks that can be more popular which will in return, bring them more revenue.

The analysis was done by first finding the popular artist, track, genre and popular artist per each genre. This allows the artists to get a general idea of what is popular on Spotify. We also analyzed frequent track attributes in popular songs. Then, a correlation plot was used find correlation between all the track attributes. Finally, we analyzed the genres by looking at the relationship between them and track attributes.

The following insights were derived from our analysis:

EDM is the most popular genre on Spotify
Most of the popular songs on Spotify are high in danceability, are louder and have a shorter duration
There is a high correlation between the energy and loudness track attributes
The track attributes energy and loudness are high in the most popular genre, EDM

Some limitations we had with this data set was the type of data available to us. It would have helped to have date for number of streams per track to further analyze it’s popularity. We also only had data for 32832 tracks when Spotify is a host for more than 50 million tracks. More observations would allow for lower variance of the data and a more accurate analysis.

This analysis can be improved by fitting a linear regression model to predict how track attributes affect popularity in the future.

In conclusion, if music artists are looking to increase their popularity on Spotify, they can do so by creating music that falls under the genre EDM or tracks that are danceable, loud and with a shorter duration and low in acouticness, liveliness, instrumentalness and speechiness.