Spotify Wrangling Midterm

Introduction

In our analysis, we would like to explore what factors influence the popularity of a song based on a Spotify data set from the TidyTuesday series. Our study may be of interest to musicians or producers who want to understand what ways they can make music that would be more popular with their Spotify target audience. Perhaps the factors we identify here can inform artists on ways they can make their music heard by a larger audience as well. Even if we don’t find relationships among the metrics here with popularity, that in itself is an interesting conclusion that can inform the decisions of those who make and listen to music. As a consumer, it can sometimes be hard to pin point why or why not a song is enjoyable. We can help Spotify listeners to identify certain songs that have similar songs to others that they enjoy in a way that can help improve their listening experience.

Since there are a lot of variables in this data set to explore, through the course of our cleaning and anlaysis of data we will pinpoint a handful that we can explore further. In this way, addressing our problems involves choosing some key variables to look at and not getting lost in the data. After cleaning, we plan on exploring the following variables’ relationships with popularity – valence, energy, mode, loudness – which are further explained in our variable dictionary. Another relationship we may potentially explore is the relationship between release date and popularity of a song, which will require care in data type conversion.

Packages Required

Tidyverse is a collection of packages that is designed to simplify data analysis. A number of the functions contained in library(tidyverse) make it easier to sort through data, look at specific variables or columns, rename or create variables, group data differently, and much more.

library(tidyverse)

Data Preparation

Data Source

The data used in this project was obtained from this page ¹ via the following code.

spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   track_id = col_character(),
##   track_name = col_character(),
##   track_artist = col_character(),
##   track_album_id = col_character(),
##   track_album_name = col_character(),
##   track_album_release_date = col_character(),
##   playlist_name = col_character(),
##   playlist_id = col_character(),
##   playlist_genre = col_character(),
##   playlist_subgenre = col_character()
## )

## See spec(...) for full column specifications.

Data Background

This 2020 Spotify data comes from the spotifyr package, which is an R wrapper that was created by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff to make it easier to access your own Spotify data or general data about songs from Spotify’s API.

The data set explored here was gathered by Kaylin Pavlik using audio features of the Spotify data in pursuit of exploration and classification of a collection of songs from 6 main genres (EDM, Latin, Pop, R&B, Rap, and Rock).

Data Dictionary

When initially downloaded, this data contained the following 23 variables:

Variable	Class	Description
track_id	character	Unique ID of a song
track_name	character	Song name
track_artist	character	Song artist
track_popularity	double	Song popularity on a scale of 0 to 100 where a higher number means more popular
track_album_id	character	Unique ID of an album
track_album_name	character	Album name that the song belongs to
track_album_release_date	character	Date the album was released
playlist_name	character	Name of the playlist
playlist_id	character	Unique playlist ID
playlist_genre	character	Genre of a playlist
playlist_subgenre	character	Subgenre of a playlist
danceability	double	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	double	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	double	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	double	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	double	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	double	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	double	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	double	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	double	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	double	Duration of song in milliseconds

Preliminary Exploration

First, we look at the first and last few rows in order to see what our actual data looks like.

head(spotify_songs, 5)

tail(spotify_songs, 10)

Next, we would like to investigate the structure or our data set and its classifications. We learn that there are 32,833 total entries here and that the Spotify data inherits the attributes of multiple classes. Knowing what classes this data belongs to gives insight into what different methods we can use to conduct our data analysis.

class(spotify_songs)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

str(spotify_songs)

## tibble [32,833 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32833] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32833] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32833] 194754 162600 176616 169093 189052 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   track_id = col_character(),
##   ..   track_name = col_character(),
##   ..   track_artist = col_character(),
##   ..   track_popularity = col_double(),
##   ..   track_album_id = col_character(),
##   ..   track_album_name = col_character(),
##   ..   track_album_release_date = col_character(),
##   ..   playlist_name = col_character(),
##   ..   playlist_id = col_character(),
##   ..   playlist_genre = col_character(),
##   ..   playlist_subgenre = col_character(),
##   ..   danceability = col_double(),
##   ..   energy = col_double(),
##   ..   key = col_double(),
##   ..   loudness = col_double(),
##   ..   mode = col_double(),
##   ..   speechiness = col_double(),
##   ..   acousticness = col_double(),
##   ..   instrumentalness = col_double(),
##   ..   liveness = col_double(),
##   ..   valence = col_double(),
##   ..   tempo = col_double(),
##   ..   duration_ms = col_double()
##   .. )

Looking at a summary of each variable allows us to identify any initially abnormal values. In the cleaning steps, the character variables will be changed to factors in order to categorize the data into levels so we can learn more. From this initial summary, none of the numeric variables have any apparent outliers.

summary(spotify_songs)

##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

The following gives us a concise look at as much data as possible by printing the column of a data frame downward instead of across.

glimpse(spotify_songs)

## Rows: 32,833
## Columns: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdf…
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lu…
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "T…
## $ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, …
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E…
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Lux…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "2…
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop …
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7c…
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "p…
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "danc…
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.…
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.…
## $ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5…
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.3…
## $ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,…
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.12…
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030,…
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00…
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.14…
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.…
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, …
## $ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 1630…

Before changing any data types, we have five missing values each in track_name, track_artist, and track_album_name.

colSums(is.na(spotify_songs))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

In the next section, we will do more investigation and decide how to handle these missing values.

Data Cleaning

Variable Type Conversion

To begin cleaning the data, we decided to change the following character variables to factors to better understand the data for now: track_name,track_artist, track_album_release_date, playlist_genre, playlist_subgenre, playlist_name.

spotify_songs$track_name <- as.factor(spotify_songs$track_name)
spotify_songs$track_artist <- as.factor((spotify_songs$track_artist))
spotify_songs$track_album_name <- as.factor(spotify_songs$track_album_name)
spotify_songs$track_album_release_date <- as.factor(spotify_songs$track_album_release_date)
spotify_songs$playlist_genre <- as.factor((spotify_songs$playlist_genre))
spotify_songs$playlist_subgenre <- as.factor(spotify_songs$playlist_subgenre)
spotify_songs$playlist_name <- as.factor(spotify_songs$playlist_name)

Dealing with NAs

What is the story with these tracks with NAs?

spotify_songs[rowSums(is.na(spotify_songs)) != 0,]

There are five that lack track_name and track_artist and we will delete these from the data set.

spotify_songs %>% 
  filter(track_name != " ") -> spotify_songs

Quality check:

colSums(is.na(spotify_songs))

##                 track_id               track_name             track_artist 
##                        0                        0                        0 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        0 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

Now we have cleaned our data set of missing values.

Numerical and Visual Summaries

The duplicates of songs make sense because different playlists may contain the same song. The output below shows that there are fewer unique songs than our total observations.

Number of unique track ID’s (which would delineate the number of unique songs):

spotify_songs %>% 
  distinct(track_id) %>% 
  tally()

This shows us concisely which artists are in this data and how many times they appear.

spotify_songs %>% 
  group_by(track_album_name) %>% 
  summarise(count = n()) %>% 
  arrange(-count) %>% 
  head()

We observe that there are 10,692 artists when grouped by track_artist using the code, unique(spotify_songs$track_artist).

The album names are displayed here.

spotify_songs %>% 
  group_by(track_album_name) %>% 
  summarise(count = n()) %>% 
  arrange(-count) %>% 
  head()

While it may seem logical to convert track_album_release_date to a date variable, we will need to treat this carefully because of the inconsistencies in the recording of dates. Below we see that some of these dates are just years, as opposed to the year, month, date format.

summary(spotify_songs$track_album_release_date)

## 2020-01-10 2019-11-22 2019-12-06 2019-12-13 2013-01-01 2019-11-15 2012-01-01 
##        270        244        235        220        219        215        209 
## 2010-01-01 2019-11-08 2008-01-01 2019-10-25 2019-11-01 2019-12-20 2006-01-01 
##        192        192        188        188        180        180        170 
## 2019-10-18 2019-11-29 2019-10-11 2005-01-01 2019-10-04 2019-09-27 2019-06-28 
##        168        168        164        163        159        156        154 
## 2009-01-01 2014-01-01 2007-01-01 2019-09-06 2019-09-13 2019-06-21 2019-09-20 
##        147        147        146        140        139        138        135 
## 2019-08-16 2020-01-03 2011-01-01 2020-01-17 2019-08-30 2004-01-01 2019-12-27 
##        133        133        132        131        130        116        115 
## 2019-08-23 2019-07-26 2019-05-10 2019-05-17 2019-07-12 2003-01-01 2019-07-19 
##        114        110        109        109        107        104        101 
## 2002-01-01 2019-04-26 2019-05-31 2019-08-09 2019-05-24 2019-04-05 2019-06-14 
##         95         95         95         95         92         90         90 
## 2019-07-05 2019-05-03 2001-01-01 2019-06-07 2019-03-22 2018-12-14 2019-01-18 
##         90         88         87         87         86         85         85 
## 2019-02-22 2019-03-29 1998-01-01       2005 2019-02-08 2019-08-02 1999-01-01 
##         85         84         80         80         79         79         78 
## 2019-03-01 2019-03-08 2019-02-01 2017-06-09 2018-04-06 2018-10-05 2018-11-09 
##         76         76         74         71         71         71         70 
## 2000-01-01 2019-10-31 2018-10-19 2019-04-12 2019-01-25       1998 2016-10-21 
##         69         69         68         67         66         63         63 
## 2018-11-30       1976 1988-01-01 2018-08-17 2019-01-11 2018-11-16       2004 
##         62         59         59         59         59         58         57 
## 2018-11-02 2019-02-15 2019-04-19       2006 2015-08-28 2018-09-28 2018-12-07 
##         57         57         57         56         56         56         56 
##       2001 2018-10-26 1987-01-01 2016-06-24 2018-04-27       2003 2016-05-06 
##         55         55         53         53         53         52         52 
## 2018-08-24    (Other) 
##         52      22126

There appear to be strange characters in some playlist names in the R Console. However, when we use View(summary(spotify_songs$playlist_name)), we see that these strange characters are emojis in the playlist titles. So, although this seemed abnormal, we don’t need to change these.

spotify_songs %>% 
  group_by(playlist_name) %>% 
  summarise(count = n()) %>% 
  arrange(-count) %>% 
  head()

Playlist Genre

As described in the data background, there are six genres in this data.

summary(spotify_songs$playlist_genre)

##   edm latin   pop   r&b   rap  rock 
##  6043  5153  5507  5431  5743  4951

In the relationships with playlist_genre explored through boxplots below, the plot danceability suggests the presence of outliers.

#plot(spotify_songs$playlist_genre)
boxplot(danceability ~ playlist_genre , spotify_songs)

This is a boxplot of valence with playlist genre.

boxplot(valence ~ playlist_genre , spotify_songs)

This is a boxplot of key with playlist genre.

boxplot(key ~ playlist_genre , spotify_songs)

This is a boxplot of mode and playlist genre.

boxplot(mode ~ playlist_genre , spotify_songs)

This is a boxplot of track popularity and playlist genre.

boxplot(track_popularity ~ playlist_genre , spotify_songs)

Playlist Subgenre

There are 24 playlist subgenres.

unique(spotify_songs$playlist_subgenre) #24 Levels

##  [1] dance pop                 post-teen pop            
##  [3] electropop                indie poptimism          
##  [5] hip hop                   southern hip hop         
##  [7] gangster rap              trap                     
##  [9] album rock                classic rock             
## [11] permanent wave            hard rock                
## [13] tropical                  latin pop                
## [15] reggaeton                 latin hip hop            
## [17] urban contemporary        hip pop                  
## [19] new jack swing            neo soul                 
## [21] electro house             big room                 
## [23] pop edm                   progressive electro house
## 24 Levels: album rock big room classic rock dance pop ... urban contemporary

If we plan to do more exploration into subgenre, we need to determine a better way to label the x-axis so we know which subgenre is being referred to more specifically. While the following boxplots reveal some interesting outliers, the unclear labelling of the axis makes it hard to see which subgenre in particular has the outliers.

There are some interesting outliers between danceability and playlist subgenre.

#plot(spotify_songs$playlist_subgenre)
boxplot(danceability ~ playlist_subgenre , spotify_songs)

In addition there are interesting outliers with valence.

boxplot(valence ~ playlist_subgenre , spotify_songs)

No outliers apparent between playlist subgenre and key.

boxplot(key ~ playlist_subgenre , spotify_songs)

There appears to be an interesting outlier that could be investigated further between playlist subgenre and mode.

boxplot(mode ~ playlist_subgenre , spotify_songs)

There appear to be outliers present here.

boxplot(track_popularity ~ playlist_subgenre , spotify_songs)

Playlist Name vs Playlist ID

When grouped by playlist_name, there are 449 playlists based on the output of unique(spotify_songs$playlist_name).

However, when grouped by playlist_id, there are 471 playlists; conclusion based upon unique(spotify_songs$playlist_id).

We will not do further cleaning on playlists here because playlists are not necessarily what we are concerned about in the problems we plan to address with this data.

Final Data Set

Here is a glimpse at our final data in the most condensed form possible

glimpse(spotify_songs)

## Rows: 32,828
## Columns: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdf…
## $ track_name               <fct> I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist             <fct> Ed Sheeran, Maroon 5, Zara Larsson, The Chai…
## $ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, …
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E…
## $ track_album_name         <fct> I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <fct> 2019-06-14, 2019-12-13, 2019-07-05, 2019-07-…
## $ playlist_name            <fct> Pop Remix, Pop Remix, Pop Remix, Pop Remix, …
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7c…
## $ playlist_genre           <fct> pop, pop, pop, pop, pop, pop, pop, pop, pop,…
## $ playlist_subgenre        <fct> dance pop, dance pop, dance pop, dance pop, …
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.…
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.…
## $ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5…
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.3…
## $ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,…
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.12…
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030,…
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00…
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.14…
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.…
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, …
## $ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 1630…

Data Summary

Of the 23 original variables, we plan on keeping 11 of them: track_id, track_name, track_artist, track_album_id, track_album_name, valence, energy, mode, loudness, track_popularity, track_album_release_date.

Among those, the variables of interest particular interest to us are valence, energy, mode, loudness, track_popularity.

While there may be insights that could come from playlist genre or subgenre, they are not of interest to us here and would not answer our initial questions as directly as the other song metrics available to us through the Spotify data. So, we will drop those variables in our analysis.

An area of concern may be that we investigate the popularity as a response variable in a linear regression with release date. Another aspect of the data we need to be careful with is mode, since it is a binary variable.

Variables of Concern

The original data frame has multiple observations of many songs because many playlists can have the same song as we can see here. Of our original dat set, approximately 5000 are duplicated.

spotify_songs %>% distinct(track_id, .keep_all = TRUE)

For our analysis, we want to make sure we have only 1 observation of each song. However, we will leave variations of each song in the data set as the variables of interest, like popularity, do change across the variations.

In order to have only 1 observation of each song we need to create a subset of the original data frame in this way:

spotify_songs %>% distinct(track_id, .keep_all = TRUE) -> tib_single_songs summary(tib_single_songs)

Numeric Variables of Concern

Finally, the the numeric variables we are most interested in exploring are summed up here, taken from the output summary of the subset we created in the code directly above this section.

Variable	Q₁	Mean	Q₃	Comment
track_popularity	21	39.33	58	Response Variable
energy	0.579	0.698	0.843	Explanatory Variable 1
loudness	-8.309	-6.818	-4.709	Explanatory Variable 2
valence	0.329	0.510	0.695	Explanatory Variable 3
mode	0	0.566	1	Explanatory Variable 4, Binary Variable

Proposed Exploratory Data Analysis

Moving forward, we will be splitting our columns of interest into two data frames in order find to new information in our data that is not readily apparent. Both of these new data frames will contain things like song and album names and IDs in addition to popularity. The difference will be that

one data frame contains the song metrics that are of interest to us, and
one will have release dates.

We are considering creating a new variable that is a composite of the explanatory variables. For instance, if loudness and valence could be combined maybe that would be interesting to explore further.

In order to illustrate our findings, we hope to create plots and tables using ggplot, which we be learning soon. Another type plot we would like to help illustrate our findings is one that would properly represent our binary variable mode.

We look forward to learning about the following topics to be able to answer our questions:

How to see outliers with such a large data set. We know that there is a lot of information to fit onto a small plot and it will be hard to understand the ways we choose to see what is going on in the data in a way that provides insight to our problem.
The usage and syntax of summarise(). While we attained the output we desired, there was a message we suppressed.
Will we need the date to refine our analysis? because there are duplicate songs with different versions. We anticipate that we may need to choose the earliest release of a song.

We currently plan on incorporating linear regression of our song metrics of interest with track_popularity as our response variable. When we do regression, we won’t use playlist influences. We may look at what are the top 5 most and least popular songs.

Overall, we are not sure where this analysis will lead us, but the variables we choose explore more deeply will wind up being revealing regardless of if any relationships do or do not exist.

https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md ↩︎