Spotify Music - factors that influence popularity of music

Midterm Project Components

Synopsis

Everyone wants to be famous nowadays. With social medias like Youtube, Instagram, and Tik Tok, people can easily be a content creator and become popular, as they can monetize from popularity. The content creators would constantly be thinking how to make their accounts, videos, and stream channel more attractive to the viewers. Music is one of the first things they consider when they make videos or stream, since music set the mood of their work. For example, when we are on Youtube watching a vlog, the background music of the vlog sounds light-hearted and cheerful for most viewers, we would enjoy the vedio more and have higher probability of hitting that like botton. In other words, popularity of the music is positively correlated with the popularity of the content. This is one of the many examples in life that proves the importance of music popularity.

In this project, I am using the Spotify dataset to explore the factors that contribute to the popularity of the music. This data sets contains 32833 songs with six broad genres: pop,rap, rock, latin, EDM, and R&B. It also contains 12 variables for 12 music features: acousticness,liveness,speechiness,energy,loudness, danceability, instrumentalness, valence, duration, tempo, key ,and mode. Certainly, it has a variable called track_popularity, rating the popularity of the music from 0 to 100. The methodology I am using is to conduct univariate analysis on useful variables: use summarization (Pivot tables) and graphs such as barplots and boxplots to show the correlation between popularity of music and factors such as release date, features, and genres of the music. From the methods above, we can conclude the factors that contribute positively and negatively to the popularity of music, and the factors that are not correlated with popularity of music.

This project will guide the customers on the choice of music, in order to maximize the popularity of their content. Youtubers and streamers could use the result of this project to choose background music, to increase the popularity of their videos. Restaurants and shops can also benefit from the result of this project to choose the popular tracks to play in their place, in order to maximize the satisfaction of their customers. In a nutshell, they can monetize from the popularity of the music they choose.

Packages Required

These packages are required to manipulate and visualize the data.

library(dplyr) #manipulate data
library(tidyverse) ## Tidying data
library(ggplot2) ## Visualizing data
library(knitr) ## Show original data in good format
library(DT) ## Output data in nice format
library(magrittr) ## Pipe operators

Data Preparation

The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. The Spotify’s API provides artist, album, track data, audio features, genres for each song.

The Spotify Genre Data was downloaded beforehand, and I downloaded the cvs file from a dropbox on 3/31/2020.

song <- read.csv("spotify_songs.csv", stringsAsFactors = FALSE)

There are 32833 observations, each contains a song, and 23 variables.

# Find the dimensions of the dataset.
dim(song)

## [1] 32833    23

Then I check how many missing values exists in each variable. It turns out 3 out of 23 variables have 5 missing values.

# See how many NA is in each variable
sapply(song,function(x) sum(is.na(x)))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

I removed some columns due to their ineffectiveness, including the 3 variables with missing values mentioned above: track_name, track_artist, and track_album_name. Since the purpose of this project is to find the factors that influence popularity, variables such as the name and id of the song, album, or artist does not provide much useful analytic information. So I removed them.

dropcol <- c("track_id","track_name","track_artist","track_album_id","track_album_name","playlist_name","playlist_id")
song1<- song %>% select(-all_of(dropcol))

Because I want to explore the correlation between popularity and year, month, date of the song release, I break down the release date to those 3 variables. After checking, however, I found out that the variable track_album_release_date contains 1855 values without month and 1886 values without date. This will be further addressed in exploratory data analysis. Since Year, Month, and Date are created for EDA purposes, I do not delete observations due to NA in these variables.

# Break down the date to Year, Month, and Date
song1 %>% separate(track_album_release_date,c("Year","Month","Date"),sep = "-") -> song2
sapply(song2,function(x) sum(is.na(x)))

##  track_popularity              Year             Month              Date 
##                 0                 0              1855              1886 
##    playlist_genre playlist_subgenre      danceability            energy 
##                 0                 0                 0                 0 
##               key          loudness              mode       speechiness 
##                 0                 0                 0                 0 
##      acousticness  instrumentalness          liveness           valence 
##                 0                 0                 0                 0 
##             tempo       duration_ms 
##                 0                 0

I converted the categorical variables to factors.

song2 <- song2 %>% 
  mutate(playlist_genre = as_factor(playlist_genre),
         playlist_subgenre = as_factor(playlist_subgenre),
         )

Here is a glimpse of all the variables now that they are cleaned with correct data types.

str(song2)

## 'data.frame':    32833 obs. of  18 variables:
##  $ track_popularity : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ Year             : chr  "2019" "2019" "2019" "2019" ...
##  $ Month            : chr  "06" "12" "07" "07" ...
##  $ Date             : chr  "14" "13" "05" "19" ...
##  $ playlist_genre   : Factor w/ 6 levels "pop","rap","rock",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ playlist_subgenre: Factor w/ 24 levels "dance pop","post-teen pop",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ danceability     : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy           : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key              : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness         : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode             : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness      : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness     : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness         : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence          : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo            : num  122 100 124 122 124 ...
##  $ duration_ms      : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

Below is a preview of cleaned data.

song2 %>% head(50) %>% datatable()

Below is a table of the variable name, data types, and a description for each variable.

song2.type <- lapply(song2,class)
song2.name <- colnames(song2)
song2.des <- c('Song Popularity (0-100) where higher is better',
               'Year when song is released',
               'Month when song is released',
               'Date when song is released',
               'Playlist genre',
               'Playlist subgenre',
               'Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.',
               'Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.',
               'The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.',
               'The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.',
               'Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.',
               'Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.',
               'A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.',
               'Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.',
               'Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.',
               'A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).',
               'The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.',
               'Duration of song in milliseconds')
song2destable <- as_tibble(cbind(song2.name,song2.type,song2.des))
colnames(song2destable) <- c('Variable Name', 'Data Type', 'Variable Description')
kable(song2destable)

Variable Name	Data Type	Variable Description
track_popularity	integer	Song Popularity (0-100) where higher is better
Year	character	Year when song is released
Month	character	Month when song is released
Date	character	Date when song is released
playlist_genre	factor	Playlist genre
playlist_subgenre	factor	Playlist subgenre
danceability	numeric	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	numeric	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	integer	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C<U+266F>/D<U+266D>, 2 = D, and so on. If no key was detected, the value is -1.
loudness	numeric	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	integer	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	numeric	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	numeric	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	numeric	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	numeric	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	numeric	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	numeric	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	integer	Duration of song in milliseconds

Proposed Exploratory Data Analysis

I first look into the Year of song release, and how it would relate to the popularity of the music. The range of the Year is from 1957 to 2020. I create a new variable called `Decade, grouping the Year into 4 groups.

range(song2$Year)

## [1] "1957" "2020"

song2 <- song2 %>% 
  mutate(Decade = ifelse(Year > 1950 & Year<=1970,"1950-1970",
                  ifelse(Year > 1970 & Year<=1990,"1970-1990",
                  ifelse(Year > 1990 & Year<=2010,"1990-2010",
                  ifelse(Year > 2010 & Year<=2020,"2010-2020","other")))))

From the pivot statistics, I found out that songs released during 1950-1970 are most popular, while song released during 1990-2010 are the least popular ones. However, there is a much higher frequency for songs released during 2010 to 2020 in this dataset.

song2 %>% 
  group_by(Decade) %>% 
  summarize(count = n(),
            average_popularity = mean(track_popularity, na.rm = TRUE),
            median_popularity = median(track_popularity, na.rm = TRUE)) %>% 
  arrange(desc(average_popularity))

## # A tibble: 4 x 4
##   Decade    count average_popularity median_popularity
##   <chr>     <int>              <dbl>             <dbl>
## 1 1950-1970   257               48.0                59
## 2 1970-1990  2361               45.1                51
## 3 2010-2020 23384               44.1                46
## 4 1990-2010  6831               35.8                39

I use boxplots to demonstrate how Month of the song release is correlated with song’s popularity. From the boxplots, I can tell that month of release does not affect the popularity of the song.

boxplot(track_popularity~Month, data = song2)

The rest variables are features, and some of those features are similar from my perspective, such as energy and loudness, and key and mode. I plan to use a summaization or visualization of the correlations between features, and drop some feature.

For remaining features, I plan to use barplot and boxplot to visualize whether the feature is correlated with the popularity of the music. For example, for speechiness, values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. In this case I can slice the variable, speechiness, into 3 groups and put the designation into a new variable. Then I can summarize the data or visualize based on the 3 groups to explore some trend or correlation.

For now, I do not know how to use ggplot to create the aesthetic and effective, and I am looking forward to learn that in the coming weeks.

I plan on incorporating supervised learning to answer my question. I plan to run linear regression to figure out how much each feature is contributing the popularity of music. Meanwhile, I might run linear regression on music genres with features. In this way, I can make connections between the features and music genre. Therefore, the results give the customer clearer guidence on what genre of music is most popular.

Spotify Music - factors that influence popularity of music

Kaiwen Xu

4/2/2020

Midterm Project: Popularity of Spotify Audio

Midterm Project Components

Synopsis

Packages Required

Data Preparation

Proposed Exploratory Data Analysis