Spotify

Introduction

What is Spotify?

Spotify is a public trading Sweden company founded in 2006. Users are able to access music directly from Spotify without physically owning cds, records etc. It revolutionized the music industry by allowing music fans to listen to music without possessing physical copies of CDs or records, meanwhile the artists got paid based on the number of streams. It has become the world’s largest music streaming provider with more than 70 millions licensed songs and 172 million premium subscribers. The company gains its revenue by running ads and muti-level premium membership subscription. You could access Spotify in most part of Europe and the Americas and more than 40 countries in Africa, as well as across multiple devices including computers, smartphones and tablets etc.

[Reference] https://en.wikipedia.org/wiki/Spotify

Why are we interested in Spotify Genre Data?

Music help people relaxing, focus, creates atmosphere and break awkwardness.

Good music and good coders are inseparable.

In the 3rd qtr earning call, Spotify’s total Monthly Active Users grew 19% to 381 million in the quarter, up from 365 million last quarter. According to the latest earning call, Spotify experienced double digit growth in all regions. What is the secret in Spotify’s business model? How does Spotify curate their playlist to benefit both users and artists? How could we help Spotify growing premium subscribers? What common acoustic features are for these most popular songs? We’re going to dive in the secret recipe of spotify’s most popular songs in this report.

[Reference] (https://investors.spotify.com/financials/press-release-details/2021/Spotify-Technology-S.A.-Announces-Financial-Results-for-Third-Quarter-2021/default.aspx)

Purpose of the Report

  • Discover the General Music Trends for Musician
  • Assist Spotify Staff on Music Selection for Playlist
  • Direction for Artists’ Music Creation

Methodology & Analytic Technique

  • Data Visualization
  • Multiple linear Regression

Required Packages

Explanation of each package

Package Name description
ggplot2 ggplot2 implements the grammar of graphics. We use it to visualize data.
dplyr dplyr is a grammar of data manipulation. You can use it to solve the most common data manipulation challenges.
tidyverse tidyverse helps you to create tidy data or data where each variable is in a column, each observation is a row end each value is a cell.
readr readr is a fast and friendly way to read rectangular data.
plotly plotly is for creating interactive web-based graphs via the open source JavaScript graphing library plotly.
reshape2 reshape2 is for flexibly restructuring and aggregating data.
sqldf sqldf an R package for running SQL statements on R data frames, optimized for convenience.

Packages used & Messages and warnings

The following packages are required for this analysis report.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(ggplot2)
library(readr)
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(sqldf)
## Warning: package 'sqldf' was built under R version 4.1.2
## Loading required package: gsubfn
## Warning: package 'gsubfn' was built under R version 4.1.2
## Loading required package: proto
## Warning: package 'proto' was built under R version 4.1.2
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 4.1.2

Data Preparation

Original Source

The data for this report comes from TidyTuesday.

This Spotify Songs Page talks about that the original Spotify Data in greater detail. The original data comes from spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.

Import Original Data

This table contains 32833 observations, 23 variables. 10 of them are character 13 columns are double.

songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

Original Data Details

The following chart explains the features and measurement of each variable. The dataset has 23 variables as follows.

variable class description
track_id character Song unique ID - Every track_name is matching with one Song ID
track_name character Song Name - One song could be included in different albums or playlist
track_artist character Song Artist
track_popularity double Song Popularity (0-100) where higher is better
track_album_id character Album unique ID - Every Album has its unique ID
track_album_name character Song album name
track_album_release_date character Date when album released
playlist_name character Name of playlist
playlist_id character Playlist ID
playlist_genre character Playlist genre
playlist_subgenre character Playlist sub-genre
danceability double Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy double Energy is a measure from 0.0 to 1.0 . It represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key double The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness double The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db .
mode double Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness double Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness double A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness double Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness double Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence double A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo double The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms double Duration of song in milliseconds.

Data Observation

We used summary() and head () to pull an overview of the spotify dataet. We might need to make adjustment to convert genre, subgenre into factor. The release date is catorized as charactor, which could be convert to date. We noticed that the “duration_ms” means Duration of song in milliseconds.

head(songs, 10)
## # A tibble: 10 x 23
##    track_id               track_name track_artist track_popularity track_album_id
##    <chr>                  <chr>      <chr>                   <dbl> <chr>         
##  1 6f807x0ima9a1j3VPbc7VN I Don't C~ Ed Sheeran                 66 2oCs0DGTsRO98~
##  2 0r7CVbZTWZgbTCYdfa2P31 Memories ~ Maroon 5                   67 63rPSO264uRjW~
##  3 1z1Hg7Vb0AhHDiEmnDE79l All the T~ Zara Larsson               70 1HoSmj2eLcsrR~
##  4 75FpbthrwQmzHlBJLuGdC7 Call You ~ The Chainsm~               60 1nqYsOef1yKKu~
##  5 1e8PAfcKUYoKkxPhrHqw4x Someone Y~ Lewis Capal~               69 7m7vv9wlQ4i0L~
##  6 7fvUMiyapMsRRxr07cU8Ef Beautiful~ Ed Sheeran                 67 2yiy9cd2QktrN~
##  7 2OAylPUDDfwRGfe0lYqlCQ Never Rea~ Katy Perry                 62 7INHYSeusaFly~
##  8 6b1RNvAcJjQH73eZO4BLAB Post Malo~ Sam Feldt                  69 6703SRPsLkS4b~
##  9 7bF6tCO3gFb8INrEDcjNT5 Tough Lov~ Avicii                     68 7CvAfGvq4RlIw~
## 10 1IXGILkPm0tOCNeq00kCPa If I Can'~ Shawn Mendes               67 4QxzbfSsVryEQ~
## # ... with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>

Before we started cleaning the data set, we take a simple view on the data summary. This table contains 32833 observations, 23 variables. 10 of them are character columns, while the rest are numeric. Then, we checked if there are missing values and duplicates in the current data set, further decide which method is more appropriate ate to apply, making the data organized and neat. All detailed explanations of variables are listed at the last part of report.

Data Cleaning

Step 1. Checking Missing Data

Check how many missing data is included in the raw dataset. Where are the missing data?
1. the first code summed up the NA value of 15
2. The missing values are identified at column track_name, track_artist and track_album_name

sum(is.na(songs))
## [1] 15
colSums(is.na(songs))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

Judging how large the dataset is (32833 rows), we decided to omit the NA data.

songs <- na.omit(songs)
colSums(is.na(songs))
##                 track_id               track_name             track_artist 
##                        0                        0                        0 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        0 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

Step 2. Checking Duplicates

There’s no duplicate rows in this dataset. We didn’t check individual track_id because some songs could appear multiple times in different playlist or album.

Total 15 missing values are found under column, track name, track artist, and album name. We deleted 5 incomplete rows since this size of the sample would not influence much on our analysis. For the next step, we used track id number as the primary key to check if duplicates exist. However, the appearance frequency indicates that some songs are included in different playlists. Considering that the purpose is to analyze playlists with different genres, we decided to keep all observations. Rerun the is.na() function, ensure there is no missing values obtained. To improve readability, we transferred the unit of duration from “ms” to “s” through creating a new column and dropping the previous column.

n_occur <- data.frame(table(songs$track_id))     
count(n_occur[n_occur$Freq > 1, ])
##      n
## 1 3165

Step 3. Adjusting the formatting for album_release_date column

While observing the dataset, we found out that some of the formatting for album_release_date is not consistent. For example, some dates are in the format of %Y-%m-%d, but some only has year information. To solve this issue, we decided to make a copy of the column and separate them into year, month and date separately. This adjustment will alow us to explore more on the next steps.

We did notice that 1885 rows of release date data was missing.

songs$track_album_release_date_copy = songs$track_album_release_date
songs <- songs %>% separate(track_album_release_date_copy, sep = "-", into = c("album_release_year", "album_release_month", "album_release_date"))
colSums(is.na(songs), na.rm = TRUE)
##                 track_id               track_name             track_artist 
##                        0                        0                        0 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        0 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms       album_release_year 
##                        0                        0                        0 
##      album_release_month       album_release_date 
##                     1855                     1886

We exam the structure of the separated value. We think they need to be converted from chr to date data

songs$album_release_year <- as.numeric(songs$album_release_year)
songs$album_release_month <- as.numeric(songs$album_release_month)
songs$album_release_date <- as.numeric(songs$album_release_date)

Step 4. Modify the measurement in duration

We think it’s too abstract for users to have a sense on how long is 216000 milliseconds. Regular people are not going to use milliseconds as units - nobody says " the dinner is going to be ready in 216000 millisencods". To make the report readable, we decide to use minutes as unit for this column.

spotify_complete <- mutate(songs, 
                           duration_m = duration_ms/60000)
spotify_complete <- subset(spotify_complete, select = -c(duration_ms))
names(spotify_complete)
##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "album_release_year"       "album_release_month"     
## [25] "album_release_date"       "duration_m"

Step 5. Delete unused variables

To reduce operatin time and increase aesthetics for whole analysis, we delete unused variables including track_id, track_album_id, playlist_id.

songs <- spotify_complete[,-c(1,5,9,24,25)]

Step 6. Check Outliers

We draw box plots for features of the song to locate outliers. Except popularity, key, and valence, though we observed numerous data fell outside of the 25th and 75th percentile range. Associated with the obvious variance range we gathered from the summary table, they are acceptable.

df <- songs %>% select(playlist_genre, danceability, speechiness, energy,acousticness,liveness,valence)
songs_reshape <- melt(df, variable.name = 'series')
  ggplot(songs_reshape, aes( ,value, color= series))+ geom_boxplot()+ggtitle("Check Outliers")+facet_wrap(series~.)

Exploratory Data Analysis

Overview of Acoustic Features

How are danceability, speechiness, energy, acousticness, liveness, valence’s distributions look like in spotify song pools?

We reshaped the data into long format so we would be able to plot the columns in one graph to create a more direct view. 32,828 tracks are included in this list Speechiness, Acousticness, Liveness, instrumentalness all skewed to the right, which indicated that most songs were designed or created with less presence of spoken words, less acoustic, less liveness and less instrumentalness - but is it true for music that are marked with a popularity rate more than 80?

df1<- songs %>% select(track_name, danceability, speechiness, energy,acousticness,liveness,valence,instrumentalness) %>% distinct_all() 
songs_reshape1 <- melt(df1, variable.name = 'series')
a1<-  ggplot(songs_reshape1, aes(value, fill = series, color= series,))+ggtitle("Track Features in General")+geom_histogram(binwidth = 0.05)+facet_wrap(series~.)
  ggplotly(a1)

We filtered out the tracks with popularity lower than 80 and recreate the chart again. 500 tracks are included in the list.

df2<- songs %>% 
  filter(track_popularity>=80) %>%
  select(track_name, danceability, speechiness, energy,acousticness,liveness,valence,instrumentalness) %>%    
    distinct_all() 

songs_reshape2 <- melt(df2, variable.name = 'series')
a2<-  ggplot(songs_reshape2, aes(value, fill = series, color= series,))+geom_histogram(binwidth = 0.05)+ggtitle("Track Features with Popularity over 80")+facet_wrap(series~.)
  ggplotly(a2)

Besides these acoustic feature, how does some of the track gets so popular and others are untouched? Why are we looking into this - do you know that each streaming turned into a payment of between 0.003 and 0.0084 dollars.This means that a track that is streamed 1 million times could led to 3k income fore musicians. Remember the time you listen to one song over and over again and just can’t get over with it? Yes, that’s one way to support your favorite artist even though you don’t use spotify for free.

Another secret weapon of Spotify is the playlist. Spotify hires music/playlist curators to put similiar track or tracks with the same genre together for users to discover. You can also submit your suggestion to playlist curators - https://www.jscalco.com/spotify-playlist-curators-you-can-submit-to-for-free/

Once again we looked into the tracks with popularity equal and higher than 80. We found out that most popular songs were published after 2010, which is not long after spotify were founded, except for songs in rock genre. Rock music fans must still prefer the classic rock n roll or maybe that’s what playlist curators think they are!

a3<- songs %>% filter(track_popularity >=80) %>%
  distinct_all() %>%
  ggplot(aes(x = factor(album_release_year), y = track_popularity, group = album_release_year, fill = playlist_genre, color = playlist_genre))+ geom_jitter(alpha = 0.7) +  
  scale_x_discrete(name = "year", 20, labels) + ggtitle("Album Release Year and Playlist Genre (Popularity > 80)")+facet_wrap(~playlist_genre)

ggplotly(a3) 

Top 100 Songs

What is the secret behind the songs with top 100 popularity?

We have analyzed the general feature of all the songs and the tracks with popularity of 80 - 100. We then narrow it to the top 100 songs in this dataset.The purpose of this analysis is to help playlist curator to select songs based on the acoustic feature so the playlist would be more likely to get discovered.

The first row of the chart includes index of energy, danceability and speechiness. The second row of the chart includes index of liveness, acousticness and valence.

p1<- songs %>%
select(track_name, track_artist, track_popularity, track_album_name, energy) %>%
distinct_all()%>%
top_n(100, track_popularity) %>%
arrange(desc(track_popularity)) %>%
    plot_ly(x= ~track_popularity, y= ~energy,
          hoverinfo = "text",
          text = ~paste("track name:", track_name, "<br>","track artist:", track_artist, "<br>","Album name:", track_album_name, "<br>", "energy", energy, "<br>", "Track Popularity", track_popularity),
          color = ~energy) %>%
  add_markers( size = 2) %>%
  layout(title = "Top 100 songs Energy and Dancibility Distribution", xaxis = list(title = "Track Popularity"), yaxis = list(title = "energy (0-1)"))

p2<-songs %>%
select(track_name, track_artist, track_popularity, track_album_name, danceability) %>%
distinct_all() %>%
top_n(100, track_popularity) %>%
arrange(desc(track_popularity)) %>%
    plot_ly(x= ~track_popularity, y= ~danceability,
          hoverinfo = "text",
          text = ~paste("track name:", track_name, "<br>","track artist:", track_artist, "<br>","Album name:", track_album_name, "<br>", "danceability", danceability),
          color = ~danceability ) %>%
  add_markers(size = 2) %>%
  layout(xaxis = list(title = "Track Popularity"), yaxis = list(title = "danceability (0-1)"))

p3<-songs %>%
select(track_name, track_artist, track_popularity, track_album_name, speechiness) %>%
distinct_all() %>%
top_n(100, track_popularity) %>%
arrange(desc(track_popularity)) %>%
    plot_ly(x= ~track_popularity, y= ~speechiness,
          hoverinfo = "text",
          text = ~paste("track name:", track_name, "<br>","track artist:", track_artist, "<br>","Album name:", track_album_name, "<br>", "Speechiness", speechiness),
          color = ~speechiness ) %>%
  add_markers(size = 2) 

p4<-songs %>%
select(track_name, track_artist, track_popularity, track_album_name, liveness) %>%
distinct_all() %>%
top_n(100, track_popularity) %>%
arrange(desc(track_popularity)) %>%
    plot_ly(x= ~track_popularity, y= ~liveness,
          hoverinfo = "text",
          text = ~paste("track name:", track_name, "<br>","track artist:", track_artist, "<br>","Album name:", track_album_name, "<br>", "Liveness",liveness),
          color = ~liveness ) %>%
  add_markers(size = 2, colors = "Set1") 


  p5<-songs %>%
select(track_name, track_artist, track_popularity, track_album_name, acousticness) %>%
distinct_all() %>%
top_n(100, track_popularity) %>%
arrange(desc(track_popularity)) %>%
    plot_ly(x= ~track_popularity, y= ~acousticness,
          hoverinfo = "text",
          text = ~paste("track name:", track_name, "<br>","track artist:", track_artist, "<br>","Album name:", track_album_name, "<br>", "Acousticness", acousticness),
          color = ~acousticness ) %>%
  add_markers(size = 2, colors = "Set1") 
  
  p6<-songs %>%
select(track_name, track_artist, track_popularity, track_album_name, valence) %>%
distinct_all() %>%
top_n(100, track_popularity) %>%
arrange(desc(track_popularity)) %>%
    plot_ly(x= ~track_popularity, y= ~valence,
          hoverinfo = "text",
          text = ~paste("track name:", track_name, "<br>","track artist:", track_artist, "<br>","Album name:", track_album_name, "<br>", "Valence",valence),
          color = ~valence ) %>%
  add_markers(size = 2, colors = "Set1") 


subplot(p1,p2,p3,p4, p5,p6, nrows =2, shareX = TRUE, shareY = TRUE) %>%
  layout(title = "Top 100 Acoustic Feature Distribution", 
          xaxis = list(title = "Track Popularity"),
         showlegend=FALSE, showlegend2=FALSE)

Energy Level

Que Tire Pa Lante is listed with the highest energy level. We observed that if songs could made into top 100 if their energy level is above 0.5. The number 1 popular song Dance Monkey has a energy level of 0.588. One exception is from Billi Eillish’s Everything I wanted with a low energy level of 0.225

Danceability

Danceability follows the same trend of energy when it comes to top 100. most popular songs also have a danceability level of .400 and above.

Speechiness

Speechiness level for the top 100 popular tracks falls between 0 to 0.4.

Liveness

Liveness for most popular tracks ranged between 0 to 0.5. It especially clustered below 0.25.

Acoustics

The most 100 popular tracks have a range of 0 to 0.6 on acousticness level.

Valence

The level of Valence ranged all over between 0 to 1.

  • If a song’s energy >0.5, danceability > 0.4, speechines < 0.4, liveness < 0.25, acoustics < 0.6, it’s more likely for it to become a top 100 popular songs!

Popularity for playlist genre and subgenre

Firstly, we take a look at frequency distribution of playlist genres.

s5 <- sqldf(
'
select count(distinct track_name) as SongNumber, playlist_genre
from spotify_complete
group by playlist_genre
order by SongNumber desc'
)

s6 <- s5 %>%
mutate(s5,
percentage = SongNumber/ sum(SongNumber))

ggplot(s6, aes(x= reorder(playlist_genre, desc(SongNumber)), y=SongNumber, fill= playlist_genre))+
geom_bar(stat = 'identity') +
ggtitle('Rating Songs Number by Type')+
geom_text(aes(label=round((SongNumber),4)), vjust=1.6, color="black", position = position_dodge(0.9), size=3.5)+
scale_fill_brewer(palette="Paired") +
  xlab("playlist_genre")

We know about popularity for songs and can levarage them to calculate average popularity for playlist genres and subgenres.

songs %>% 
  select(track_popularity, playlist_name, playlist_genre) %>% 
  group_by(playlist_genre,playlist_name) %>% 
  summarise(ave_pop = mean(track_popularity, na.rm = T)) %>% 
  group_by(playlist_genre) %>% 
  summarise(ave_pop_new = mean(ave_pop, na.rm = T)) %>% 
 ggplot(mapping = aes(x = reorder(playlist_genre, desc(ave_pop_new)), 
                       y = ave_pop_new,
                       fill = playlist_genre)) +
  geom_bar(stat = "identity", width = 0.5) +
  ggtitle("Popularity rank for playlist_genre") +
  ylab("popularity") +
  xlab("playlist genre")

In total there are six playlist genres, including pop, latin, rap, rock, r&b and edm. The barplot shows a popularity rank among them. When the staff of spotify choose recommended playlists for homepage, they can take this rank into consideration. If genre of a playlist is pop or latin, it has more possibility to gain more popularity than other genre. Therefore, to attract more users, we can consider putting more playlists with pop, latin or rap genre into spotify homepage.

songs %>% 
  select(track_popularity, playlist_name, playlist_subgenre) %>% 
  group_by(playlist_subgenre,playlist_name) %>% 
  summarise(ave_pop = mean(track_popularity, na.rm = T)) %>% 
  group_by(playlist_subgenre) %>% 
  summarise(ave_pop_new = mean(ave_pop, na.rm = T)) %>% 
  arrange(desc(ave_pop_new)) %>% 
  ggplot(mapping = aes(x = reorder(playlist_subgenre, ave_pop_new),
                              y = ave_pop_new,
                              fill = playlist_subgenre)) +
  geom_bar(stat = "identity", width = 0.5) +
  ggtitle("Popularity rank for playlist_genre") +
  xlab("playlist subgenre") +
  ylab("popularity") +
  coord_flip()

There are 24 playlist subgenres. The barplot shows us their popularity rank. Like a rank for playlist genre, this sequence for subgenre can also be ultilized in homepage playlists selection. We can give priority to the top ten subgenres. Besides, Higher-ranked subgenres can receive more recommendations for all users.

Multiple linear regression

We use multiple linear regression models to help spotify understand which elements the audience cares most about. Then they can leverage this model to predict popularity for a new song and decide whether recommend this song or not. Besides, this model can provide rules for artists to create better and popular songs.

Step 1. Variable selection:

There are 13 numeric variables which indicate characteristic for a song. We select all of them to build our model. They are danceability , energy, key , loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo and duration_m.

songs_new <- songs[,9:21]
songs_new <- cbind(songs$track_popularity, songs_new)
songs_new <- rename(songs_new, track_popularity = `songs$track_popularity`)

Step 2. Model building

We use forward selection, backward elimination and stepwise selection to build up three models.

# Linear regression with all preditor variables
fit_max_1 <- lm(track_popularity ~ ., data = songs_new)

# Backward elimination
bs_1 <- step(fit_max_1, direction = "backward", trace = 0, k = log(nrow(songs_new)))

# Linear regression with only intercept
fit_min <- lm(track_popularity ~ 1, data = songs_new)

# Forward selection
fs_1 <- step(fit_min, direction = "forward",
             scope = list(lower = fit_min, upper = fit_max_1),
             trace = 0, k = log(nrow(songs_new)))

# Stepwise selection
ss_1 <- step(bs_1, direction = "both", 
             scope = list(lower = fit_min, upper = fit_max_1),
             trace = 0, k = log(nrow(songs_new)))

Step 3. Model selection

We use AIC, BIC, $^2_adj$, RMSE, PRESS as our selection criterions

# Function for PRESS
PRESS <- function(object, ...) {
  if(!missing(...)) {
    res <- sapply(list(object, ...), FUN = function(x) {
      sum(rstandard(x, type = "predictive") ^ 2)
    })
    names(res) <- as.character(match.call()[-1L])
    res
  } else {
    sum(rstandard(object, type = "predictive") ^ 2)
  }
}

# Function to compute various model metrics
modelMetrics <- function(object, ...) {
  if(!missing(...)) {
    res <- sapply(list(object, ...), FUN = function(x) {
      c("AIC" = AIC(x), "BIC" = BIC(x), 
        "adjR2" = summary(x)$adj.r.squared,
        "RMSE"  = sigma(x), "PRESS" = PRESS(x), 
        "nterms" = length(coef(x)))
    })
    colnames(res) <- as.character(match.call()[-1L])
    res
  } else {
    c("AIC" = AIC(object), "BIC" = BIC(object), 
      "adjR2" = summary(object)$adj.r.squared, 
      "RMSE"  = sigma(object), "PRESS" = PRESS(object),
      "nterms" = length(coef(object)))
  }
}

# Compare models
result <- modelMetrics(bs_1, fs_1, ss_1)
round(result, digits = 3)
##                bs_1         fs_1         ss_1
## AIC      302020.856   302020.856   302020.856
## BIC      302121.644   302121.644   302121.644
## adjR2         0.072        0.072        0.072
## RMSE         24.069       24.069       24.069
## PRESS  19023634.232 19023634.232 19023634.232
## nterms       11.000       11.000       11.000

With three different methods, we get same best fit model. Our final estimated model is
Track_popularity = 78.49 + 4.91 * danceability - 29.48 * energy + 1.52 * loudness - 7.41 * speechiness + 3.24 * acousticness - 11.99 * instrumentalness - 4.32 * liveness + 2.83 * valence + 0.02 * tempo - 2.75 * duration_m

Summary

Through the final model, we found that energy, speechiness, instrumentalness, liveness, duration_m are negatively related. Meanwhile, loudness, acousticness, valence, tempo and danceability are positively related.

Limitation

We acknowledge that more could be dive into on this dataset - for example, the dataset could be incorporate with the actual lyrics of the top 100 songs. Text analysis could be done to discover more insight and context of the secret behind popular songs.

Tips for spotify staff

Among these related variables, we found that energy is the most influencial score. Therefore, we recommend that Spotify staff pay more attention to the amount of energy when selecting a new song.

danceability describes the musical positiveness conveyed by a track which is highest positively related variables.

Rock music should be selected only classics. Save money for normal rock songs but buy more classic copyright. With New tech, EDM can be the future trend. We find even edm has the lowest rating, but edm is the second large total rating music type. We recommend Spotify can find more good edm songs.

Tips for Musicians

In general, we firmly believe that every type of music can gain attention and success. We encourage more musicians to create a variety of music. However, the popularity will directly affect income. Therefore, we hope our Multiple linear regression model can help singers and musicians achieve greater commercial success.

Combining our model and creation facts, we come to the following conclusions.

  • Pop is always a good topic to get public attention
  • danceability should be considered
  • music always be controlled in average time(duration).
  • Control energy, most users choose fresh and pleasant music.
  • Fast, loud and noisy music with more beat strength and overall regularity are more likely to make into the top 100 tracks list, it doesn’t matter as much if it’s live music or not.
  • The valence represents the “emotion” of a track. This chart shows that people’s emotion is all over the place – Both happy and sad songs are in high demands!