spotify_breakout_artists.knit

Hello,

It was so nice meeting the team! The project below quantifies what data can be used to predict a great collaboration, using Spotify data. I believe this is the best way to use this data not only because it completely aligns to your company mission of artist collaboration, but also because even if I did find the exact combination of metrics that best predict what has made a popular song/album, this would not be fair to the artists on the app. This is because, as an artist myself, I understand that exact formulas destroy the creative process. Great artists push boundaries, not stay within them. So to find an algorithm describing what key, how many beats per minute, etc an artist should use to make a popular hit, will be unfair to all your great artists on your app.

This is also why I feel myself to be a great fit for this position, as I can sympathize for the artist and data scientist, as I am both.

I hope you enjoy the project, and the artists on your platform gain knowledge from it. It is in a reader friendly format, and should be understandable for any individual to read.

Thank you for your time, and if you have any questions, please let me know.

ctrl shift enter ctrl alt i

Load Libraries

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

#rankings <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-14/rankings.csv")

library(spotifyr)
library(devtools)

## Loading required package: usethis

library(knitr)

Access Spotify API by obtaining the Spotify client ID, client secret, and access token on their website

id <- 
Sys.setenv(SPOTIFY_CLIENT_ID = '283de36f5e434b84a212e67311d72b06')
Sys.setenv(SPOTIFY_CLIENT_SECRET = '5967683a70704e2683dc2ca8d0e41af2')

access_token <- get_spotify_access_token()

Extract the top hits from each of the past 5 years (from 2016-2020), and combine it into one table. The data is being collected at this stage, so the table may be a bit confusing to read.

top_hits <- function(playlistid){
  get_playlist_audio_features('Spotify',playlistid)
}

Twenty_twenty <- top_hits('37i9dQZF1DX7Jl5KP2eZaS') #top hits in 2020

Twenty_nineteen <- top_hits('37i9dQZF1DWVRSukIED0e9') #top hits in 2019

Twenty_eighteen <-
top_hits('37i9dQZF1DXe2bobNYDtW8') #top hits in 2018

Twenty_seventeen <- top_hits('37i9dQZF1DWTE7dVUebpUW')#top hits in 2017

Twenty_sixteen <-
  top_hits('37i9dQZF1DX8XZ6AUo9R4R') #top hits in 2016

Top_hits_16_20 <- rbind(Twenty_twenty, Twenty_nineteen, Twenty_eighteen, Twenty_seventeen, Twenty_sixteen) #top hits from 2016-2020

Top_hits_16_20

## # A tibble: 480 x 61
##    playlist_id  playlist_name  playlist_img    playlist_owner_~ playlist_owner_~
##    <chr>        <chr>          <chr>           <chr>            <chr>           
##  1 37i9dQZF1DX~ Top Tracks of~ https://i.scdn~ Spotify          spotify         
##  2 37i9dQZF1DX~ Top Tracks of~ https://i.scdn~ Spotify          spotify         
##  3 37i9dQZF1DX~ Top Tracks of~ https://i.scdn~ Spotify          spotify         
##  4 37i9dQZF1DX~ Top Tracks of~ https://i.scdn~ Spotify          spotify         
##  5 37i9dQZF1DX~ Top Tracks of~ https://i.scdn~ Spotify          spotify         
##  6 37i9dQZF1DX~ Top Tracks of~ https://i.scdn~ Spotify          spotify         
##  7 37i9dQZF1DX~ Top Tracks of~ https://i.scdn~ Spotify          spotify         
##  8 37i9dQZF1DX~ Top Tracks of~ https://i.scdn~ Spotify          spotify         
##  9 37i9dQZF1DX~ Top Tracks of~ https://i.scdn~ Spotify          spotify         
## 10 37i9dQZF1DX~ Top Tracks of~ https://i.scdn~ Spotify          spotify         
## # ... with 470 more rows, and 56 more variables: danceability <dbl>,
## #   energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, track.id <chr>, analysis_url <chr>, time_signature <int>,
## #   added_at <chr>, is_local <lgl>, primary_color <lgl>, added_by.href <chr>,
## #   added_by.id <chr>, added_by.type <chr>, added_by.uri <chr>,
## #   added_by.external_urls.spotify <chr>, track.artists <list>, ...

Organize previous 2016-2020 table by selecting all measurable statistics, such as “danceability”, “energy”, etc. Also find the number of artists for each track. This will be helpful later on

library(dplyr)
artists <- Top_hits_16_20$track.name #find artists on each track 
collabs <-  Top_hits_16_20[str_detect(artists, c("\\(feat","\\(with")),] #find list of collaborations

collabs <- collabs %>% 
  select(c(track.artists,track.name, time_signature, (danceability:tempo))) #create collab column of important metrics

artists_on_each_track_f <- function(x)
    {
    (collabs[[1]][[x]]$name %>% 
  as_tibble() %>% 
  nrow())
  }

artists_on_each_track <- 
  map(1:nrow(collabs), artists_on_each_track_f) %>% 
  unlist() #find number of artists on each track

Make a list of every artist who contributed to those songs.

list_of_artist_names_f <- function(x)
{
  collabs[[1]][[x]]$name 
}

list_of_artist_names <- map(1:nrow(collabs), list_of_artist_names_f) %>%
  unlist() 

list_of_artist_names

##   [1] "Justin Bieber"          "Quavo"                  "Surf Mesa"             
##   [4] "Emilee"                 "Lady Gaga"              "Ariana Grande"         
##   [7] "Sam Feldt"              "RANI"                   "Billie Eilish"         
##  [10] "Khalid"                 "Khalid"                 "Disclosure"            
##  [13] "benny blanco"           "Halsey"                 "Khalid"                
##  [16] "Jonas Blue"             "Theresa Rex"            "DJ Snake"              
##  [19] "Selena Gomez"           "Ozuna"                  "Cardi B"               
##  [22] "Ed Sheeran"             "Travis Scott"           "Mark Ronson"           
##  [25] "Miley Cyrus"            "Ed Sheeran"             "Stormzy"               
##  [28] "Bad Bunny"              "Drake"                  "A Boogie Wit da Hoodie"
##  [31] "6ix9ine"                "Young T & Bugsey"       "Aitch"                 
##  [34] "Polo G"                 "Lil Tjay"               "Camila Cabello"        
##  [37] "Young Thug"             "benny blanco"           "Halsey"                
##  [40] "Khalid"                 "Kendrick Lamar"         "SZA"                   
##  [43] "Hailee Steinfeld"       "Alesso"                 "Florida Georgia Line"  
##  [46] "WATT"                   "G-Eazy"                 "Halsey"                
##  [49] "Bebe Rexha"             "Florida Georgia Line"   "Clean Bandit"          
##  [52] "Julia Michaels"         "Maroon 5"               "Cardi B"               
##  [55] "Lil Dicky"              "Chris Brown"            "Rudimental"            
##  [58] "Jess Glynne"            "Macklemore"             "Dan Caplen"            
##  [61] "Maroon 5"               "SZA"                    "Loud Luxury"           
##  [64] "Brando"                 "Clean Bandit"           "Demi Lovato"           
##  [67] "Charlie Puth"           "Kehlani"                "Jason Derulo"          
##  [70] "French Montana"         "The Weeknd"             "Kendrick Lamar"        
##  [73] "Banx & Ranx"            "Ella Eyre"              "Yxng Bane"             
##  [76] "Cashmere Cat"           "Major Lazer"            "Tory Lanez"            
##  [79] "Clean Bandit"           "Sean Paul"              "Anne-Marie"            
##  [82] "Clean Bandit"           "Zara Larsson"           "Maroon 5"              
##  [85] "Future"                 "Kygo"                   "Selena Gomez"          
##  [88] "Machine Gun Kelly"      "Camila Cabello"         "DJ Khaled"             
##  [91] "Justin Bieber"          "Quavo"                  "Chance the Rapper"     
##  [94] "Lil Wayne"              "Avicii"                 "Sandro Cavazza"        
##  [97] "Macklemore"             "Skylar Grey"            "Avicii"                
## [100] "Rita Ora"               "Maroon 5"               "Kendrick Lamar"        
## [103] "KYLE"                   "Lil Yachty"             "OneRepublic"           
## [106] "Seeb"                   "Charlie Puth"           "Selena Gomez"          
## [109] "Snakehips"              "Tinashe"                "Chance the Rapper"     
## [112] "Fifth Harmony"          "Fetty Wap"              "Illy"                  
## [115] "Vera Blue"              "Future"                 "The Weeknd"            
## [118] "Sia"                    "Kendrick Lamar"         "SAYGRACE"              
## [121] "G-Eazy"                 "Kygo"                   "Maty Noyes"

Table is now much more user friendly to read.

artists_f <- function(x)
{
  collabs[[1]][[x]]$name %>% 
  paste0(collapse = ", ") 
}

artist <- map(1:nrow(collabs), artists_f) %>%
  unlist() 
  
collabs$track.artists <- artist

collabs

## # A tibble: 54 x 14
##    track.artists   track.name  time_signature danceability energy   key loudness
##    <chr>           <chr>                <int>        <dbl>  <dbl> <int>    <dbl>
##  1 Justin Bieber,~ Intentions~              4        0.806  0.546     9    -6.64
##  2 Surf Mesa, Emi~ ily (i lov~              4        0.674  0.774    11    -7.57
##  3 Lady Gaga, Ari~ Rain On Me~              4        0.672  0.855     9    -3.76
##  4 Sam Feldt, RANI Post Malon~              4        0.59   0.642     7    -3.87
##  5 Billie Eilish,~ lovely (wi~              4        0.351  0.296     4   -10.1 
##  6 Khalid, Disclo~ Talk (feat~              4        0.9    0.4       0    -8.57
##  7 benny blanco, ~ Eastside (~              4        0.632  0.686     6    -7.66
##  8 Jonas Blue, Th~ What I Lik~              4        0.46   0.8       1    -3.58
##  9 DJ Snake, Sele~ Taki Taki ~              4        0.842  0.801     8    -4.17
## 10 Ed Sheeran, Tr~ Antisocial~              4        0.716  0.823     5    -5.31
## # ... with 44 more rows, and 7 more variables: mode <int>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>

Now using the metrics previously noted, such as danceability, energy, etc, it is now time to find the average audio features each artist has in their entire discovery.

artist_audio_features_averages_f <- function(x)
{
  get_artist_audio_features(x, include_groups = c("album", "single"))%>% 
  filter(album_release_year > "2015") %>% 
  select(time_signature, (danceability:acousticness), (liveness:tempo)) %>% 
  map(median) %>% 
  unlist() %>% 
  as_tibble_row() 
}
################preload excel beforehand#################
artist_audio_features_averages <- 
  map(list_of_artist_names, artist_audio_features_averages_f)

## Request failed [429]. Retrying in 1 seconds...
## Request failed [429]. Retrying in 1 seconds...
## Request failed [429]. Retrying in 1 seconds...
## Request failed [429]. Retrying in 1 seconds...
## Request failed [429]. Retrying in 1 seconds...
## Request failed [429]. Retrying in 1 seconds...
## Request failed [429]. Retrying in 1 seconds...
## Request failed [429]. Retrying in 1 seconds...
## Request failed [429]. Retrying in 1 seconds...
## Request failed [429]. Retrying in 1 seconds...
## Request failed [429]. Retrying in 1 seconds...
## Request failed [429]. Retrying in 1 seconds...
## Request failed [429]. Retrying in 1 seconds...
## Request failed [429]. Retrying in 1 seconds...

artist_audio_features_averages_2 <- cbind(list_of_artist_names,
  bind_rows(artist_audio_features_averages, .id = "column_label")) 

artist_audio_features_averages_2 <- artist_audio_features_averages_2[,c(-2,-8)]

Find the difference in each metric for each artist.

cumulative_artists_track <- cumsum(artists_on_each_track) -1

artist_audio_averages_collabs_f <-function(x)
  {
  artist_audio_features_averages_2[cumulative_artists_track[x],3:11] - artist_audio_features_averages_2[cumulative_artists_track[x]+1, 3:11]
}

artist_audio_average_difference_collabs <- map(1:nrow(collabs), artist_audio_averages_collabs_f) 

artist_audio_average_difference_collabs_2 <- 
  cbind(collabs %>% 
          select(track.artists, track.name),  (bind_rows(artist_audio_average_difference_collabs, .id = "column_label")))

artist_audio_average_difference_collabs_2 <- artist_audio_average_difference_collabs_2[,-3]

artist_diff_c <- 
  artist_audio_average_difference_collabs_2[-27,]

Find mean, median and standard deviation for each difference.

artist_diff_c <- artist_diff_c[-25,]

mean_artists <- artist_diff_c %>% 
  summarise_if(is.numeric, mean)  
#mean_artists <- cbind("track.artist", "track.name", mean_artists)

median_artists <- artist_diff_c %>% 
  summarise_if(is.numeric, median) 
#median_artists <- cbind("track.artist", "track.name", median_artists)

sd_artists <- artist_diff_c %>% 
  summarise_if(is.numeric, sd) 

artist_diff_c <- bind_rows(artist_diff_c, mean_artists, median_artists, sd_artists)
artist_diff_c[41:43,2] = c("mean", "median", "sd")