In the immortal words of Madonna, “Music makes the people come together.” It crosses cultures, provides entertainment and fun, and is often used to motivate and inspire during public and social events. It is used in music therapy to ease psychological disturbances and medical conditions, for lullabies to soothe our children to sleep, and plays in the background during meditative endeavors. Friendships are forged over music, and we often hear couples stake claims to a musical piece as “our song” or see groups of teenagers banding together over similar musical leanings. A common expression asserts that “music is the soundtrack of our lives.”
But when it comes to popularity, what is it about a particular song that can promote the broadest appeal? Is it the tempo that makes it popular, or how cheerful or positive its sound is? Does the sound matter at all as long as you can dance to it? The purpose of this project is to explore what makes a song popular, not just by looking at the most popular songs and analyzing their characteristics, but also by comparing them to the least popular songs. For this project, I will use an open-source data set collected from the popular music streaming service, Spotify, and analyze the attributes of the songs they offer by popularity ranking in RStudio. Through this analysis, I will attempt to answer the question: “What exactly is it about music that makes the most people come together?”
To analyze this data, I will use data.table to create limited table views of character data. I will use tidyverse, which includes packages for cleaning, working with dataframes, and creating plots. I will use stats for analyses and regression, and I will use shiny for adding design elements to visualizations.
library(data.table)
library(tidyverse)
library(shiny)
library(stats)
The Spotify data can be downloaded by clicking here.
Information about the data and variables can be viewed here.
This data was collected from Spotify in January of 2020. Therefore, all release dates for songs occur prior to February of 2020. It was collected for the purpose of sharing with the public to explore, learn, and create. The original set includes 32,833 total records with 23 variables for each song (including song title, album and unique identifier). These variables describe qualities of the song, such as loudness and danceability, and attributes of the release, such as artist and genre. Track popularity is a variable calculated by Spotify that is largely based on how much a particular track is played on their platform.
Once the data is downloaded and imported, it should be assigned to a data set called “songs”. Note that many songs are duplicated within the data set, due to the fact that they may appear on both an album and a single, or at times, multiple albums (such as an original release as well as a Greatest Hits). Since Spotify offers a unique identifier (track_id), this column can be used to remove the duplicated track listings:
songs <- songs[!duplicated(songs$track_id), ]
This leaves a new total of 28,356 records.
To make the data set easier to view and work with, variables that will definitely not be used can be dropped. I know I will not need many of the columns giving attributes of the release, so I will drop track_album_id, track_album_name, playlist_name, playlist_id, and playlist_subgenre (since I will only look at primary genres for simplicity). Track ID is no longer needed either, since all of the duplicate values were removed, and I will remove key due to its ordinal nature that would not lend value to analysis. The remaining columns may not all be used for analyses, but they all offer opportunities for additional analysis and storytelling.
songs <- subset(songs, select = -c(track_id, track_album_id, track_album_name, playlist_name, playlist_id, playlist_subgenre, key))
After removing these columns, the data is in good rough shape for analysis. The final two issues that need to be addressed are missing values and variable types. A look at missing values reveals that there are eight spread throughout the entire set.
sum(is.na(songs))
## [1] 8
To determine where the missing data is, it is necessary to view the rows where these values lie.
songs[!complete.cases(songs), ]
## track_name track_artist track_popularity track_album_release_date
## 8152 <NA> <NA> 0 2012-01-05
## 9283 <NA> <NA> 0 2017-12-01
## 9284 <NA> <NA> 0 2017-12-01
## 19569 <NA> <NA> 0 2012-01-05
## playlist_genre danceability energy loudness mode speechiness acousticness
## 8152 rap 0.714 0.821 -7.635 1 0.1760 0.0410
## 9283 rap 0.678 0.659 -5.364 0 0.3190 0.0534
## 9284 rap 0.465 0.820 -5.907 0 0.3070 0.0963
## 19569 latin 0.675 0.919 -6.075 0 0.0366 0.0606
## instrumentalness liveness valence tempo duration_ms
## 8152 0.00000 0.1160 0.649 95.999 282707
## 9283 0.00000 0.5530 0.191 146.153 202235
## 9284 0.00000 0.0888 0.505 86.839 206465
## 19569 0.00653 0.1030 0.726 97.017 252773
Since these tracks consist of nebulous data, of which there are only four total rows, these will be removed from the set.
songs <- songs[complete.cases(songs), ]
This leaves a new total of 28,352 records.
The next issue, as described above, is variable types. A look at the structure of the data reveals that all of the remaining variables are appropriate types with the exception of track album release dates, which are stored as character variables. Looking closer at these, it can be observed that most values are in the format YYYY-MM-DD with many values stored only as four-digit years. Since all of these records begin with the year, the year can be extracted for all records using only the first four digits. Therefore, this variable will be cleaned by isolating the four-digit year then converting it to a numeric variable.
songs$track_album_release_date = substr(songs$track_album_release_date, 1, 4)
songs$track_album_release_date <- as.numeric(songs$track_album_release_date)
This leaves a cleaned data set, illustrated by a condensed view of the first ten rows:
as_tibble(songs)
## # A tibble: 28,352 x 16
## track_name track_artist track_popularity track_album_rel~ playlist_genre
## <chr> <chr> <int> <dbl> <chr>
## 1 I Don't Care ~ Ed Sheeran 66 2019 pop
## 2 Memories - Di~ Maroon 5 67 2019 pop
## 3 All the Time ~ Zara Larsson 70 2019 pop
## 4 Call You Mine~ The Chainsmo~ 60 2019 pop
## 5 Someone You L~ Lewis Capaldi 69 2019 pop
## 6 Beautiful Peo~ Ed Sheeran 67 2019 pop
## 7 Never Really ~ Katy Perry 62 2019 pop
## 8 Post Malone (~ Sam Feldt 69 2019 pop
## 9 Tough Love - ~ Avicii 68 2019 pop
## 10 If I Can't Ha~ Shawn Mendes 67 2019 pop
## # ... with 28,342 more rows, and 11 more variables: danceability <dbl>,
## # energy <dbl>, loudness <dbl>, mode <int>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, duration_ms <int>
The final set contains 16 variables as follows:
## [1] "pop" "rap" "rock" "latin" "r&b" "edm"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4000 187741 216933 226575 254975 517810
This analysis will consist of two parts. In part one, I will analyze popularity using the numeric variables that describe the attributes of the song. I will search for the model of best fit, and propose a regression equation for predicting the popularity of a track. In part two, I will evaluate the qualities of the most popular tracks, compared to the properties of the lease popular tracks. Overall, this analysis will provide a multi-faceted view of the qualities that differentiate the most popular music to the least popular music. Vizualizations will include a plot of the regression equation with descriptions of the co-efficients, and charted comparisons of high and low categories as well as other numeric variables of interest. Since I am still a student of R, this will require learning the available packages and their capabilities as well as methods that will enhance the overall appearance of visualization outputs.