Introduction

In the immortal words of Madonna, “Music makes the people come together.” It crosses cultures, provides entertainment and fun, and is often used to motivate and inspire during public and social events. It is used in music therapy to ease psychological disturbances and medical conditions, for lullabies to soothe our children to sleep, and plays in the background during meditative endeavors. Friendships are forged over music, and we often hear couples stake claims to a musical piece as “our song” or see groups of teenagers banding together over similar musical leanings. A common expression asserts that “music is the soundtrack of our lives.”

But when it comes to popularity, what is it about a particular song that can promote the broadest appeal? Is it the tempo that makes it popular, or how cheerful or positive its sound is? Does the sound matter at all as long as you can dance to it? The purpose of this project is to explore what makes a song popular, not just by looking at the most popular songs and analyzing their characteristics, but also by comparing them to the least popular songs. For this project, I will use an open-source data set collected from the popular music streaming service, Spotify, and analyze the attributes of the songs they offer by popularity ranking in RStudio. Through this analysis, I will attempt to answer the question: “What exactly is it about music that makes the most people come together?”

Packages Required

To analyze this data, I will use data.table to create limited table views of character data. I will use tidyverse, which includes packages for cleaning, working with dataframes, and creating plots. I will use stats for analyses and regression, and I will use shiny for adding design elements to visualizations.

library(data.table)
library(tidyverse)
library(shiny)
library(stats)

Data Preparation

The Spotify data can be downloaded by clicking here.

Information about the data and variables can be viewed here.

This data was collected from Spotify in January of 2020. Therefore, all release dates for songs occur prior to February of 2020. It was collected for the purpose of sharing with the public to explore, learn, and create. The original set includes 32,833 total records with 23 variables for each song (including song title, album and unique identifier). These variables describe qualities of the song, such as loudness and danceability, and attributes of the release, such as artist and genre. Track popularity is a variable calculated by Spotify that is largely based on how much a particular track is played on their platform.

Once the data is downloaded and imported, it should be assigned to a data set called “songs”. Note that many songs are duplicated within the data set, due to the fact that they may appear on both an album and a single, or at times, multiple albums (such as an original release as well as a Greatest Hits). Since Spotify offers a unique identifier (track_id), this column can be used to remove the duplicated track listings:

songs <- songs[!duplicated(songs$track_id), ]

This leaves a new total of 28,356 records.

To make the data set easier to view and work with, variables that will definitely not be used can be dropped. I know I will not need many of the columns giving attributes of the release, so I will drop track_album_id, track_album_name, playlist_name, playlist_id, and playlist_subgenre (since I will only look at primary genres for simplicity). Track ID is no longer needed either, since all of the duplicate values were removed, and I will remove key due to its ordinal nature that would not lend value to analysis. The remaining columns may not all be used for analyses, but they all offer opportunities for additional analysis and storytelling.

songs <- subset(songs, select = -c(track_id, track_album_id, track_album_name, playlist_name, playlist_id, playlist_subgenre, key))

After removing these columns, the data is in good rough shape for analysis. The final two issues that need to be addressed are missing values and variable types. A look at missing values reveals that there are eight spread throughout the entire set.

sum(is.na(songs))
## [1] 8

To determine where the missing data is, it is necessary to view the rows where these values lie.

songs[!complete.cases(songs), ]
##       track_name track_artist track_popularity track_album_release_date
## 8152        <NA>         <NA>                0               2012-01-05
## 9283        <NA>         <NA>                0               2017-12-01
## 9284        <NA>         <NA>                0               2017-12-01
## 19569       <NA>         <NA>                0               2012-01-05
##       playlist_genre danceability energy loudness mode speechiness acousticness
## 8152             rap        0.714  0.821   -7.635    1      0.1760       0.0410
## 9283             rap        0.678  0.659   -5.364    0      0.3190       0.0534
## 9284             rap        0.465  0.820   -5.907    0      0.3070       0.0963
## 19569          latin        0.675  0.919   -6.075    0      0.0366       0.0606
##       instrumentalness liveness valence   tempo duration_ms
## 8152           0.00000   0.1160   0.649  95.999      282707
## 9283           0.00000   0.5530   0.191 146.153      202235
## 9284           0.00000   0.0888   0.505  86.839      206465
## 19569          0.00653   0.1030   0.726  97.017      252773

Since these tracks consist of nebulous data, of which there are only four total rows, these will be removed from the set.

songs <- songs[complete.cases(songs), ]

This leaves a new total of 28,352 records.

The next issue, as described above, is variable types. A look at the structure of the data reveals that all of the remaining variables are appropriate types with the exception of track album release dates, which are stored as character variables. Looking closer at these, it can be observed that most values are in the format YYYY-MM-DD with many values stored only as four-digit years. Since all of these records begin with the year, the year can be extracted for all records using only the first four digits. Therefore, this variable will be cleaned by isolating the four-digit year then converting it to a numeric variable.

songs$track_album_release_date = substr(songs$track_album_release_date, 1, 4)
songs$track_album_release_date <- as.numeric(songs$track_album_release_date)

This leaves a cleaned data set, illustrated by a condensed view of the first ten rows:

as_tibble(songs)
## # A tibble: 28,352 x 16
##    track_name     track_artist  track_popularity track_album_rel~ playlist_genre
##    <chr>          <chr>                    <int>            <dbl> <chr>         
##  1 I Don't Care ~ Ed Sheeran                  66             2019 pop           
##  2 Memories - Di~ Maroon 5                    67             2019 pop           
##  3 All the Time ~ Zara Larsson                70             2019 pop           
##  4 Call You Mine~ The Chainsmo~               60             2019 pop           
##  5 Someone You L~ Lewis Capaldi               69             2019 pop           
##  6 Beautiful Peo~ Ed Sheeran                  67             2019 pop           
##  7 Never Really ~ Katy Perry                  62             2019 pop           
##  8 Post Malone (~ Sam Feldt                   69             2019 pop           
##  9 Tough Love - ~ Avicii                      68             2019 pop           
## 10 If I Can't Ha~ Shawn Mendes                67             2019 pop           
## # ... with 28,342 more rows, and 11 more variables: danceability <dbl>,
## #   energy <dbl>, loudness <dbl>, mode <int>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <int>

The final set contains 16 variables as follows:

  1. track_name: a listing of each song title
  2. track_artist: the artist or band who recorded the song
  3. track_popularity: ranges from 0 to 100, with higher scores representing higher popularity
  4. track_album_release_date: converted to year of release (skewed toward newer releases)
  5. playlist_genre: the primary genre of the track, with the unique categories of:
## [1] "pop"   "rap"   "rock"  "latin" "r&b"   "edm"
  1. danceability: ranges from 0.0 to 1.0 with 1.0 being most danceable, based on a combination of musical elements
  2. energy: ranges from 0.0 to 1.0 with 1.0 being highest energy, based perceptually on intensity, loudness, activity, etc.
  3. loudness: ranges from -60db to 0db, averaged for the overall loudness of the entire track
  4. mode: either 0 or 1 for minor melodic scale or major melodic scale, respectively
  5. speechiness: ranges from 0 to 1, with higher values representing a higher volume of spoken words, such as rap music (data trends melodic)
  6. acousticness: ranges from 0 to 1, with higher values representing higher acousticness of a track (data trends less acoustic)
  7. instrumentalness: ranges from 0 to 1, with higher values representing fewer vocal elements (data trends less instrumental)
  8. liveness: ranges from 0 to 1, with higher values indicating a greater probability that the track was performed live (somewhat skewed toward studio recording)
  9. valence: ranges from 0 to 1, with higher values indicating a more positive or cheerful tone
  10. tempo: the overall estimated tempo of the track in beats per minute (BPM), with higher values indicating higher speed of tempo
  11. duration_ms: duration of the track in milliseconds
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4000  187741  216933  226575  254975  517810

Proposed Exploratory Data Analysis

This analysis will consist of two parts. In part one, I will analyze popularity using the numeric variables that describe the attributes of the song. I will search for the model of best fit, and propose a regression equation for predicting the popularity of a track. In part two, I will evaluate the qualities of the most popular tracks, compared to the properties of the lease popular tracks. Overall, this analysis will provide a multi-faceted view of the qualities that differentiate the most popular music to the least popular music. Vizualizations will include a plot of the regression equation with descriptions of the co-efficients, and charted comparisons of high and low categories as well as other numeric variables of interest. Since I am still a student of R, this will require learning the available packages and their capabilities as well as methods that will enhance the overall appearance of visualization outputs.