For my midterm project, I choose to explore the Spotify data set, and figure out how features, genres, and release date of the music influence the popularity of the music. Below are the different elements of the project broken out on each tab.
Everyone wants to be famous nowadays. With social medias like Youtube, Instagram, and Tik Tok, people can easily be a content creator and become popular, as they can monetize from popularity. The content creators would constantly be thinking how to make their accounts, videos, and stream channel more attractive to the viewers. Music is one of the first things they consider when they make videos or stream, since music set the mood of their work. For example, when we are on Youtube watching a vlog, the background music of the vlog sounds light-hearted and cheerful for most viewers, we would enjoy the vedio more and have higher probability of hitting that like botton. In other words, popularity of the music is positively correlated with the popularity of the content. This is one of the many examples in life that proves the importance of music popularity.
In this project, I am using the Spotify dataset to explore the factors that contribute to the popularity of the music. This data sets contains 32833 songs with six broad genres: pop,rap, rock, latin, EDM, and R&B. It also contains 12 variables for 12 music features: acousticness,liveness,speechiness,energy,loudness, danceability, instrumentalness, valence, duration, tempo, key ,and mode. Certainly, it has a variable called track_popularity, rating the popularity of the music from 0 to 100. The methodology I am using is to conduct univariate analysis on useful variables: use summarization (Pivot tables) and graphs such as barplots and boxplots to show the correlation between popularity of music and factors such as release date, features, and genres of the music. From the methods above, we can conclude the factors that contribute positively and negatively to the popularity of music, and the factors that are not correlated with popularity of music.
This project will guide the customers on the choice of music, in order to maximize the popularity of their content. Youtubers and streamers could use the result of this project to choose background music, to increase the popularity of their videos. Restaurants and shops can also benefit from the result of this project to choose the popular tracks to play in their place, in order to maximize the satisfaction of their customers. In a nutshell, they can monetize from the popularity of the music they choose.
These packages are required to manipulate and visualize the data.
library(dplyr) #manipulate data
library(tidyverse) ## Tidying data
library(ggplot2) ## Visualizing data
library(knitr) ## Show original data in good format
library(DT) ## Output data in nice format
library(magrittr) ## Pipe operators
The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. The Spotify’s API provides artist, album, track data, audio features, genres for each song.
The Spotify Genre Data was downloaded beforehand, and I downloaded the cvs file from a dropbox on 3/31/2020.
song <- read.csv("spotify_songs.csv", stringsAsFactors = FALSE)
There are 32833 observations, each contains a song, and 23 variables.
# Find the dimensions of the dataset.
dim(song)
## [1] 32833 23
Then I check how many missing values exists in each variable. It turns out 3 out of 23 variables have 5 missing values.
# See how many NA is in each variable
sapply(song,function(x) sum(is.na(x)))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
I removed some columns due to their ineffectiveness, including the 3 variables with missing values mentioned above: track_name, track_artist, and track_album_name. Since the purpose of this project is to find the factors that influence popularity, variables such as the name and id of the song, album, or artist does not provide much useful analytic information. So I removed them.
dropcol <- c("track_id","track_name","track_artist","track_album_id","track_album_name","playlist_name","playlist_id")
song1<- song %>% select(-all_of(dropcol))
Because I want to explore the correlation between popularity and year, month, date of the song release, I break down the release date to those 3 variables. After checking, however, I found out that the variable track_album_release_date contains 1855 values without month and 1886 values without date. This will be further addressed in exploratory data analysis. Since Year, Month, and Date are created for EDA purposes, I do not delete observations due to NA in these variables.
# Break down the date to Year, Month, and Date
song1 %>% separate(track_album_release_date,c("Year","Month","Date"),sep = "-") -> song2
sapply(song2,function(x) sum(is.na(x)))
## track_popularity Year Month Date
## 0 0 1855 1886
## playlist_genre playlist_subgenre danceability energy
## 0 0 0 0
## key loudness mode speechiness
## 0 0 0 0
## acousticness instrumentalness liveness valence
## 0 0 0 0
## tempo duration_ms
## 0 0
I converted the categorical variables to factors.
song2 <- song2 %>%
mutate(playlist_genre = as_factor(playlist_genre),
playlist_subgenre = as_factor(playlist_subgenre),
)
Here is a glimpse of all the variables now that they are cleaned with correct data types.
str(song2)
## 'data.frame': 32833 obs. of 18 variables:
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ Year : chr "2019" "2019" "2019" "2019" ...
## $ Month : chr "06" "12" "07" "07" ...
## $ Date : chr "14" "13" "05" "19" ...
## $ playlist_genre : Factor w/ 6 levels "pop","rap","rock",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ playlist_subgenre: Factor w/ 24 levels "dance pop","post-teen pop",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
Below is a preview of cleaned data.
song2 %>% head(50) %>% datatable()
Below is a table of the variable name, data types, and a description for each variable.
song2.type <- lapply(song2,class)
song2.name <- colnames(song2)
song2.des <- c('Song Popularity (0-100) where higher is better',
'Year when song is released',
'Month when song is released',
'Date when song is released',
'Playlist genre',
'Playlist subgenre',
'Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.',
'Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.',
'The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.',
'The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.',
'Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.',
'Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.',
'A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.',
'Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.',
'Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.',
'A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).',
'The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.',
'Duration of song in milliseconds')
song2destable <- as_tibble(cbind(song2.name,song2.type,song2.des))
colnames(song2destable) <- c('Variable Name', 'Data Type', 'Variable Description')
kable(song2destable)
| Variable Name | Data Type | Variable Description |
|---|---|---|
| track_popularity | integer | Song Popularity (0-100) where higher is better |
| Year | character | Year when song is released |
| Month | character | Month when song is released |
| Date | character | Date when song is released |
| playlist_genre | factor | Playlist genre |
| playlist_subgenre | factor | Playlist subgenre |
| danceability | numeric | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | numeric | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | integer | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C<U+266F>/D<U+266D>, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | numeric | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | integer | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | numeric | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | numeric | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | numeric | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | numeric | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | numeric | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | numeric | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | integer | Duration of song in milliseconds |
I first look into the Year of song release, and how it would relate to the popularity of the music. The range of the Year is from 1957 to 2020. I create a new variable called `Decade, grouping the Year into 4 groups.
range(song2$Year)
## [1] "1957" "2020"
song2 <- song2 %>%
mutate(Decade = ifelse(Year > 1950 & Year<=1970,"1950-1970",
ifelse(Year > 1970 & Year<=1990,"1970-1990",
ifelse(Year > 1990 & Year<=2010,"1990-2010",
ifelse(Year > 2010 & Year<=2020,"2010-2020","other")))))
From the pivot statistics, I found out that songs released during 1950-1970 are most popular, while song released during 1990-2010 are the least popular ones. However, there is a much higher frequency for songs released during 2010 to 2020 in this dataset.
song2 %>%
group_by(Decade) %>%
summarize(count = n(),
average_popularity = mean(track_popularity, na.rm = TRUE),
median_popularity = median(track_popularity, na.rm = TRUE)) %>%
arrange(desc(average_popularity))
## # A tibble: 4 x 4
## Decade count average_popularity median_popularity
## <chr> <int> <dbl> <dbl>
## 1 1950-1970 257 48.0 59
## 2 1970-1990 2361 45.1 51
## 3 2010-2020 23384 44.1 46
## 4 1990-2010 6831 35.8 39
I use boxplots to demonstrate how Month of the song release is correlated with song’s popularity. From the boxplots, I can tell that month of release does not affect the popularity of the song.
boxplot(track_popularity~Month, data = song2)
The rest variables are features, and some of those features are similar from my perspective, such as energy and loudness, and key and mode. I plan to use a summaization or visualization of the correlations between features, and drop some feature.
For remaining features, I plan to use barplot and boxplot to visualize whether the feature is correlated with the popularity of the music. For example, for speechiness, values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. In this case I can slice the variable, speechiness, into 3 groups and put the designation into a new variable. Then I can summarize the data or visualize based on the 3 groups to explore some trend or correlation.
For now, I do not know how to use ggplot to create the aesthetic and effective, and I am looking forward to learn that in the coming weeks.
I plan on incorporating supervised learning to answer my question. I plan to run linear regression to figure out how much each feature is contributing the popularity of music. Meanwhile, I might run linear regression on music genres with features. In this way, I can make connections between the features and music genre. Therefore, the results give the customer clearer guidence on what genre of music is most popular.