Music Exploratory Analysis (Using Spotify data via the spotifyr package)

Introduction

1.1 Problem Statement: The objective of this project is to explore the Spotify Genre Data and see what factors make a song popular (if any), what genre is considered the most popular, what are the most produced genres, etc. We have 23 different variables to help us try and answer these questions. By considering these questions, we are able to determine what a song needs in order to be become popular in today’s world, aiding the music recording industry and hopeful recording artists.

1.2 Plan: The first step will be to clean and organize the data, spotify_songs.csv. Next, we will start simple and find out what are the most produced genres, what genre is the most popular, etc. After that, we will compare a few variables to each other to try and find correlation. This step will be tricky because correlation doesn’t always mean causation, so we want to avoid comparing variables that may not impact each other.

1.3 Approach: Our current approach consists of using basic statistical techniques, such as mean, variance, correlation, and hypothesis testing, to partially address our problem. We know that this approach will not fully address all our problems, but it is a good start as we continue to learn more. We will use packages such as tidyverse to help clean the data. We may use ggplot2 in order to help with the correlation graphs of the data, but we have never used ggplot2 before, so we might stick with the base R codes to do that.

1.4 Consumer Advantage: Our analysis can help a consumer, such as a recording artist or a music producer, decide how to approach creating a song and what factors to consider in order to maximize the popularity of their song. It can also allow the consumer to decide what genre to get involved with.

Packages Required

  • tidyr - clean and organize the data
  • ggplot2 - visualize the data
  • dplyr - helps with common data manipulation tasks
  • readr - gives the name and type of each column
  • tibble - stores data as a tibble, which makes it easier to handle and manipulate data

*all of the above packages are included in tidyverse, so it’s easiest to load that

  • knitr - helps create easy-to-read tables
library(tidyverse)
library(knitr)

Data Preparation

Spotify Songs

The data comes from Spotify via the spotifyr package. Kaylin Pavlik gathered the original data to compare six genres of music to summarize what variables stand out in specific genres.

The original dataset contains 23 variables and 32,833 songs spanning across 6 genres (EDM, Latin, Pop, R&B, & Rock). Some peculiarities that we noticed in the original data set was that some of the observations within track_artist,track_name, playlist_name, and track_album_name all had many observations with some unusual characters. For example, a couple of the observations had Beyoncé as the track_artist. This may be an error with how the data was imputed, and the observations are supposed to indicate Beyonce potentially instead. Our challenge will be to fix these unusual characters.

Data Import

Let’s take a look at the data:

spotify <- read_csv("spotify_songs.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   track_id = col_character(),
##   track_name = col_character(),
##   track_artist = col_character(),
##   track_album_id = col_character(),
##   track_album_name = col_character(),
##   track_album_release_date = col_character(),
##   playlist_name = col_character(),
##   playlist_id = col_character(),
##   playlist_genre = col_character(),
##   playlist_subgenre = col_character()
## )
## See spec(...) for full column specifications.
class(spotify)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
dim(spotify)
## [1] 32833    23
glimpse(spotify)
## Rows: 32,833
## Columns: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdf…
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lu…
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "T…
## $ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, …
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E…
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Lux…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "2…
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop …
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7c…
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "p…
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "danc…
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.…
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.…
## $ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5…
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.3…
## $ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,…
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.12…
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030,…
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00…
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.14…
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.…
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, …
## $ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 1630…

Looking at the initial data, we can see that the variable names don’t need adjusting as they are already in an easy to read and organized format. The variables are also the correct type except for track_album_release_date which is currently a character. We don’t expect to use this variable though so we will leave it as is for now and change it later if we need to.

Data Cleaning

The first part of data cleaning is removing the outliers. One outlier is song duration. There is one song that is 4 seconds, and multiple songs around 30 seconds. We are going to remove the 4 second song, but keep the 30 second songs as many of these are interludes and they could hint at if an entire album is popular, as if an entire album is popular, people would listen all the way through, including the interludes. The longest song is 8 minutes and 37 seconds, which we will keep.

The second part of data cleaning is removing any abnormalities. We are also going to remove a few songs that go above 0 dB as this is considered abnormal.

The third part of data cleaning is accessing any missing values. We also have a total of 15 missing values that are spread across 5 rows. Since we aren’t sure what they are and since the popularity is 0 for all 5 rows, we decided to delete them.

spotify <- spotify[!(spotify$duration_ms < 5000), ] # removing the 4 second song
spotify <- spotify[!(spotify$loudness > 0), ] # removing songs over 0 dB (considered too loud)
colSums(is.na(spotify))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
spotify <- spotify[!is.na(spotify$track_name), ] # remove NA values

There are also 4477 duplicates of songs (using the track_id variable) due to the same song being on multiple playlists. After examining the data, the playlist was the only difference between these duplicates (all other values remained the same), so we determined that it was okay to remove these duplicates.

However, before we removed the duplicates, we gathered the data to see which track_id’s appeared on the most playlists, and count the frequency of the number of duplicates, as this can give us a hint at what makes a song popular.

This give us three datasets, two dedicated to the duplicate songs and their frequency, and one where there was a single observations of each song. We wanted the spotify dataset to consist of only of single observations, so that when analysis is done, the overall output is not skewed from having multiple entries of the same song.

spotify_dup_count <- spotify %>% count(track_id, sort=TRUE)
spotify_dup_count #creating a tibble to store track_id's and number of duplicates
## # A tibble: 28,345 x 2
##    track_id                   n
##    <chr>                  <int>
##  1 7BKLCZ1jbUBVqRi2FVlTVw    10
##  2 14sOS5L36385FJ3OL8hew4     9
##  3 3eekarcy7kvN4yt5ZFzltW     9
##  4 0nbXyq5TXYPCO7pr3N8S4I     8
##  5 0qaWEvPkts34WF68r8Dzx9     8
##  6 0rIAC4PXANcKmitJfoqmVm     8
##  7 0sf12qNH5qcw8qpgymFOqD     8
##  8 2b8fOow8UzyDFAE27YhOZM     8
##  9 2Fxmhks0bxGSBdJ92vM42m     8
## 10 2tnVG71enUj33Ic2nFN6kZ     8
## # … with 28,335 more rows
spotify_freq_count <- spotify_dup_count %>% count(n,sort=TRUE)
spotify_freq_count #creating a tibble to store number of duplicates and their frequencies
## # A tibble: 10 x 2
##        n    nn
##    <int> <int>
##  1     1 25180
##  2     2  2383
##  3     3   510
##  4     4   142
##  5     5    60
##  6     6    35
##  7     7    17
##  8     8    15
##  9     9     2
## 10    10     1
spotify <- spotify[!duplicated(spotify$track_id), ] # removing duplicated songs

After cleaning the data, we are left with the following observations and columns:

dim(spotify)
## [1] 28345    23

Data Preview

glimpse(spotify)
## Rows: 28,345
## Columns: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdf…
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lu…
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "T…
## $ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, …
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E…
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Lux…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "2…
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop …
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7c…
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "p…
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "danc…
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.…
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.…
## $ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5…
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.3…
## $ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,…
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.12…
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030,…
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00…
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.14…
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.…
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, …
## $ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 1630…

Variable Summary

Variable.type <- lapply(spotify, class)
Variable.desc <- c("Unique ID assigned to each song", "Song name", "Song artist", 
                   "Song popularity (0-100) where higher is better", "Unique ID assigned to each album",
                   "Album name that song is on", "Date when album was released", 
                   "Name of playlist that has song on it",
                   "Unique ID of playlist", "Genre of the playlist", "Subgenre of the playlist", 
                   "How suitable a track is for dancing (0 is least danceable and 1 is most danceable",
                   "Energy represents the measure of intensity and activity (range from 0 to 1)",
                   "Overall key of the track (0 = C, 1 = C#, 2 = D, etc. -1 = no key detected)",
                   "Loudness of a track in decibels (dB)", "Modality of a track (major = 1, minor = 0)",
                   "Presence of spoken words in a track (range from 0 to 1)",
                   "Confidence measure from 0 to 1 of whether the track is acoustic", 
                   "Predicts whether a track contains vocals or not (1 indicates no vocals, 0 is high vocals",
                   "Detects the presence of an audience in the recording (values above .8 are most likely live tracks",
                   "Describes positivity of a song (1 is cheerful, 0 is sad or angry)", 
                   "Tempo of a track in beats per minute (BPM)", "Duration of song in milliseconds")
Variable.name1 <- colnames(spotify)
data.desc <- as_tibble(cbind(Variable.name1, Variable.type, Variable.desc))
colnames(data.desc) <- c("Variable Name", "Data Type", "Variable Description")
kable(data.desc)
Variable Name Data Type Variable Description
track_id character Unique ID assigned to each song
track_name character Song name
track_artist character Song artist
track_popularity numeric Song popularity (0-100) where higher is better
track_album_id character Unique ID assigned to each album
track_album_name character Album name that song is on
track_album_release_date character Date when album was released
playlist_name character Name of playlist that has song on it
playlist_id character Unique ID of playlist
playlist_genre character Genre of the playlist
playlist_subgenre character Subgenre of the playlist
danceability numeric How suitable a track is for dancing (0 is least danceable and 1 is most danceable
energy numeric Energy represents the measure of intensity and activity (range from 0 to 1)
key numeric Overall key of the track (0 = C, 1 = C#, 2 = D, etc. -1 = no key detected)
loudness numeric Loudness of a track in decibels (dB)
mode numeric Modality of a track (major = 1, minor = 0)
speechiness numeric Presence of spoken words in a track (range from 0 to 1)
acousticness numeric Confidence measure from 0 to 1 of whether the track is acoustic
instrumentalness numeric Predicts whether a track contains vocals or not (1 indicates no vocals, 0 is high vocals
liveness numeric Detects the presence of an audience in the recording (values above .8 are most likely live tracks
valence numeric Describes positivity of a song (1 is cheerful, 0 is sad or angry)
tempo numeric Tempo of a track in beats per minute (BPM)
duration_ms numeric Duration of song in milliseconds

Proposed Exploratory Data Analysis

Plan of Attack

We plan to separate the data by genres, and then compare different variables with the popularity variable to see if there is any correlation. We want to see what could potentially make a song from a specific genre popular. We also want to see if there is any correlation between the popularity of the song and the number of different playlists that the song appears on. This will be done through various manipulation techniques of the original data set, in order to simplify the necessary code for the desired output. The manipulation techniques include creating new variables (as seen with spotify_dup_count and spotify_freq_count), and cutting up the data in different ways in order to pull out desired variables and outcomes.The most difficult part of this will be separating the data by genres. Once we have that, correlation tests should be simple enough. After that, we can make simple correlation graphs, but hopefully we can make more advanced graphs eventually.

Plots and Tables

We will mostly be using correlation graphs for this data and potentially a few density plots. We want to use correlation graphs to see if there is any correlation between popularity and other variables. Density plots may provide useful information in finding these correlations faster as it will show the trends between genres.

Learning Objectives

Currently, we do not confidently know how to separate the data into genres, but we have a good idea of how to do it. We also have never used ggplot2, and it’s not required to complete this project, but it will greatly help visually share our findings. There are also a lot of strange characters in the song names, and it would be nice to know how to fix those, but not necessary in order to get our answers.

Machine Learning Techniques

We may use linear regression to see what variables impact popularity, but that may also be overkill as we are not building a model and mostly just looking for correlation. We aren’t necessarily trying to predict values here, just show what variables may cause songs to become popular. Doing a cluster analysis may also be useful in answering our questions, because we can arrange the songs into different genre clusters and may help us better visualize overall genres popularity.