Spotify Wrangling Final Report

Introduction

In our analysis, we would like to explore what factors influence the popularity of a song based on a Spotify data set from the TidyTuesday series. Our study may be of interest to musicians or producers who want to understand what ways they can make music that would be more popular with their Spotify target audience. Perhaps the factors we identify here can inform artists on ways they can make their music heard by a larger audience as well. Even if we don’t find relationships among the metrics here with popularity, that in itself is an interesting conclusion that can inform the decisions of those who make and listen to music. As a consumer, it can sometimes be hard to pin point why or why not a song is enjoyable. We can help Spotify listeners to identify certain songs that have similar songs to others that they enjoy in a way that can help improve their listening experience.

Since there are a lot of variables in this data set to explore, through the course of our cleaning and anlaysis of data we will pinpoint a handful that we can explore further. In this way, addressing our problems involves choosing some key variables to look at and not getting lost in the data. After cleaning our data with methods we learned throughout this course, we plan on exploring the variables valence, energy, mode, loudness relationships with track_popularity using linear regression and carefully constructed ggplots to conduct numerical and visual analyses. These variables are further explained in the variable dictionary below.

Packages Required

Tidyverse is a collection of packages that is designed to simplify data analysis. A number of the functions contained in library(tidyverse) make it easier to sort through data, look at specific variables or columns, rename or create variables, group data differently, visualize data (using ggplot) and more.

library(tidyverse)

Data Preparation

Data Source

The data used in this project was obtained from the GitHub TidyTuesday project.¹

spotify_import <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

Data Background

This 2020 Spotify data comes from the spotifyr package, which is an R wrapper that was created by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff to make it easier to access your own Spotify data or general data about songs from Spotify’s API.

The data set explored here was gathered by Kaylin Pavlik using audio features of the Spotify data in pursuit of exploration and classification of a collection of songs from 6 main genres (EDM, Latin, Pop, R&B, Rap, and Rock).

Data Dictionary

When initially downloaded, this data contained 23 variables. We used the following for this project:

Variable	Class	Description
track_id	character	Unique ID of a song
track_name	character	Song name
track_artist	character	Song artist
track_popularity	double	Song popularity on a scale of 0 to 100 where a higher number means more popular
playlist_name	character	Name of the playlist
playlist_id	character	Unique playlist ID
playlist_genre	character	Genre of a playlist
playlist_subgenre	character	Subgenre of a playlist
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
valence	double	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
mode	double	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

Preliminary Exploration and Cleaning

First, we looked at the first and last few rows in order to see what our actual data looks like. We then investigated the structure of our data set and its classifications. We learned that there are 32,833 total observations and that the Spotify data inherits the attributes of multiple classes.

head(spotify_import, 5)
tail(spotify_import, 10)
class(spotify_import)
str(spotify_import)

Looking at the summary of each variable allowed us to identify any initial abnormal values. None of the numeric variables had any apparent outliers.

summary(spotify_import)

We had five missing values each in track_name, track_artist, and track_album_name which we removed from our data set.

colSums(is.na(spotify_import))
spotify_import %>% 
  filter(track_name != " ") -> spotify_rough

We removed the columns that will not be useful in our analysis. We then changed the following character variables to factors to avoid confusion in our analysis: track_name,track_artist, playlist_genre, playlist_subgenre and mode. We changed mode levels to be “Minor” and “Major” to keep our analysis and visualizations clear. Later, in our exploratory analysis, we investigate whether or not the key of a song has any impact on the popularity of a song.

spotify_rough %>% 
  select(track_id, track_name, track_artist, track_popularity, playlist_id, playlist_genre, playlist_subgenre, energy, loudness, mode, valence) -> spotify_rough2

spotify_rough2$track_name <- as.factor(spotify_rough2$track_name)
spotify_rough2$track_artist <- as.factor(spotify_rough2$track_artist)
spotify_rough2$playlist_genre <- as.factor(spotify_rough2$playlist_genre)
spotify_rough2$playlist_subgenre <- as.factor(spotify_rough2$playlist_subgenre)

spotify_rough2$mode <- as.factor(spotify_rough2$mode)
spotify_rough2$mode <- fct_recode(spotify_rough2$mode, Minor = "0", Major = "1")

spotify_songs_with_playlist <- spotify_rough2

We know there are duplicates in track_name because a track can be contained in multiple playlists.

spotify_songs_with_playlist %>% 
  distinct(track_id) %>% 
  tally()

Our analysis should have only one observation per song, so we created a dataset with these 26,294 distinct observations. We made sure to remove the duplicates that also had two different track_id values, we allowed the track with the highest popularity to stay in the dataset. We will revisit double classifications in ‘playlist_genre’ and ‘playlist_subgenre’ after our primary analysis.

spotify_songs_with_playlist %>% 
  select(track_id, track_name, track_artist, track_popularity, energy, loudness, valence, mode) %>% 
  distinct() -> distinct_tracks_rough

distinct_tracks_rough %>% 
  group_by(track_name, track_artist) %>% 
  arrange(desc(track_popularity)) %>% 
  top_n(1, track_popularity) -> distinct_tracks_rough

spotify_distinct_track_ids <- distinct_tracks_rough
spotify_songs <- ungroup(spotify_distinct_track_ids)

The following gives us a concise look at the final data set with which we will proceed for our analysis, where we have one character variable (track_id), three factors (track_name, track_artist, mode), and four numeric variables (track_popularity,energy,loudness,valence).

glimpse(spotify_songs)

Data Summary

Of the 23 original variables, we have kept 8: track_id, track_name, track_artist, track_popularity,energy, valence, loudness and mode. The numeric variable names and their first quartiles, means, third quartiles and how we will use them in our analysis are listed in the following table.

Variable	Q₁	Mean	Q₃	Comment
track_popularity	24	40.41	58	Response Variable
energy	0.577	0.697	0.843	Explanatory Variable 1
loudness	-8.339	-6.841	-4.724	Explanatory Variable 2
valence	0.325	0.510	0.690	Explanatory Variable 3
mode	0	0.566	1	Explanatory Variable 4, Binary Variable

Exploratory Data Analysis

Will energy, valence and/or loudness have a linear relationship with track popularity? Will the popular songs in this data set be in a minor key more often than in a major key?

We investigate these questions via linear regression and and visual analysis in conjunction with separating, joining, and taking subsets of our data. We do not employ new variables in the questions we ask.

Initial Observations of Track Popularity

We started our analysis by looking at the distribution of track_popularity. The high number of songs (2,301) with a zero popularity rating prompted us to delve further into the calculation of track popularity. Because we lack Spotify domain knowledge, we investigated community conversation about track_popularity and realized there are inconsistencies. We decided proceed with analysis on data sets that both included and excluded said zeros.

A left skew of track_popularity is apparent regardless of the presence of track popularity zeroes. There are relatively few songs with a popularity rating above 75 and we will investigate these songs towards the end of our report.

Minor and Major Tracks Show Similar Distributions

A boxplot faceted by mode gives us a clear understanding as to how important a minor or major key is to track_popularity. We observe the quartiles and medians as virtually the same. Thus these visualizations, including track_popularity observations equal to zero and without, quickly illustrate that mode does not influence track_popularity in this data set.

Regression Analysis of Energy, Loudness and Valence on Track Popularity

After regressing each of energy, loudness and valence on track_popularity we came up with the following simple linear regression coefficients, p-values and Adjusted R-squared values:

##Energy
popularity_fit_energy <- lm(formula = track_popularity ~ energy, data = spotify_songs)
#plot(spotify_songs$energy, spotify_songs$track_popularity)
#abline(popularity_fit_energy, lwd = 2, col = "red2")
summary(popularity_fit_energy)

popularity_fit_energy_nozero <- lm(formula = track_popularity ~ energy, data = spotify_nozeroes)
#plot(spotify_nozeroes$energy, spotify_nozeroes$track_popularity)
#abline(popularity_fit_energy_nozero, lwd = 2, col = "red2")
summary(popularity_fit_energy_nozero)

##Loudness
popularity_fit_loudness <- lm(formula = track_popularity ~ loudness, data = spotify_songs)
#plot(spotify_songs$loudness, spotify_songs$track_popularity)
#abline(popularity_fit_loudness, lwd = 2, col = "red2")
summary(popularity_fit_loudness)

popularity_fit_loud_nozero <- lm(formula = track_popularity ~ loudness, data = spotify_nozeroes)
#plot(spotify_nozeroes$loudness, spotify_nozeroes$track_popularity)
#abline(popularity_fit_loud_nozero, lwd = 2, col = "red2")
summary(popularity_fit_loud_nozero)

##Valence
popularity_fit_valence <- lm(formula = track_popularity ~ valence, data = spotify_songs)
#plot(spotify_songs$valence, spotify_songs$track_popularity)
#abline(popularity_fit_valence, lwd = 2, col = "red2")
summary(popularity_fit_valence)

popularity_fit_Valence_nozero <- lm(formula = track_popularity ~ valence, data = spotify_nozeroes)
#plot(spotify_nozeroes$valence, spotify_nozeroes$track_popularity)
#abline(popularity_fit_Valence_nozero, lwd = 2, col = "red2")
summary(popularity_fit_Valence_nozero)

Regression of Predictor Variables on Track Popularity

Variable	Includes Track Pop = 0?	SLR coefficient (Beta₁)	p-value	Adj R²
energy	yes	-12.932	<2e-16	0.011
energy	no	-8.527	<2e-16	0.001
loudness	yes	0.309	5.01e-11	0.002
loudness	no	0.383	<2e-16	0.003
valence	yes	3.562	5.62e-09	0.001
valence	no	4.412	7.09e-15	0.002

We can interpret this table as follows: The variable energy has a negative relationship to track_popularity and, in this data set, we would expect track_popularity to decrease by almost 13 or 9 percentage points (depending on inclusion/exclusion of popularity zeros) from energy value changing from 0 to 1. That being said, the R² is barely 0.01 in the case when all of the track_popularity values of zero are included and not even 0.01 when they are removed. This means that barely 1% of the track_popularity data can be explained by energy, making this not a meaningful predictor for track_popularity.
The variables loudness and valence each have a positive linear relationship with track_popularity. However, these R² values are also very small, suggesting that the track_popularity variable is not explained by either loudness or energy.

Thus, all four of our numeric predictor variables have shown not able to predict a track’s popularity.

Now, we will revisit the influence of playlist genre and subgenre to see if either of those variables can help us predict a track’s popularity. We used a left join to retrieve the tracks’ respective playlists, knowing it is possible for a track to be found in more than one playlist. We then visualized these results using the boxplots below:

#summary(spotify_songs_with_playlist)
#rejoin our smaller spotify_songs with the tracks' playlists
#head(spotify_songs_with_playlist)
playlist_subset_og <- left_join(spotify_nozeroes, spotify_songs_with_playlist, by = "track_id")
playlist_subset_og %>% select(loudness.x, valence.x, mode.x, energy.x, track_id, track_name.x, track_artist.x, track_popularity.x, playlist_genre, playlist_subgenre) -> playlist_subset
dim(playlist_subset)

genre_fit <- lm(track_popularity.x ~ playlist_genre, data = playlist_subset)
summary(genre_fit)

ggplot(playlist_subset, aes(x = reorder(playlist_genre, desc(playlist_genre)), y = track_popularity.x, fill = playlist_genre)) + 
  labs(x = 'Playlist Genre', y = 'Track Popularity', title = 'Track Popularity by Playlist Genre') +
  scale_fill_brewer(palette = "YlGnBu") +
  geom_boxplot(show.legend = FALSE) + coord_flip()

All genres have tracks at varying track_popularity ratings, which was to be expected, but there does appear to be a difference between the average track_popularity between genres. However, after conducting a simple linear regression analysis, once again we find a very small R² of 0.03, meaning only 3% of track_popularity can be explained by playlist genre.

#genre_fit <- lm(track_popularity.x ~ playlist_subgenre, data = playlist_subset)
#summary(genre_fit)

ggplot(playlist_subset, aes(x = reorder(playlist_subgenre, desc(playlist_subgenre)), y = track_popularity.x, fill = playlist_genre)) + 
  labs(x = 'Playlist Subgenre', y = 'Track Popularity', title = 'Track Popularity by Playlist Subgenre') +
  scale_fill_brewer(palette = "YlGnBu") +
  geom_boxplot(show.legend = FALSE) + coord_flip()

Visualizing the tracks’ subgenre illustrates that the medians and ranges of varying subgenres can be quite different. After conducting a simple linear regression analysis with playlist_subgenre as the predictor variable, we observe an R² of 0.113, the highest we have seen in our analysis. While it is interesting to see that knowing a track’s playlist_subgenre can predict about 11% of the its track_popularity, this is obviously still not a very strong predictor. After doing this analysis, we can at least say that if you want a shot of choosing a popular song on Spotify, you can go to a post-teen pop playlist first!

playlist_subset %>% 
  filter(playlist_subgenre == 'post-teen pop') %>%
  ggplot(aes(x = track_popularity.x, y = playlist_subgenre)) +
    geom_jitter(color = "darkcyan", show.legend = FALSE, alpha = .5) +
    labs(x = 'Track Popularity', y = "Post-Teen Pop", title = 'Track Popularity of the Post-Teen Pop Subgenre') +
    theme(axis.text.y = element_blank(),
          axis.title.y=element_blank(),
          axis.ticks.y=element_blank())

#playlist_subset %>% 
#  filter(playlist_subgenre == 'post-teen pop') %>% 
#  summary()

Summary Statistics of Interest for the Post-Teen Pop Subgenre

Variable	Q₁	Mean	Median	Q₃
track_popularity	49	60	67	77
energy	0.59	0.71	0.73	0.84
loudness	-6.6	-5.8	-5.3	-4.2
valence	0.39	0.55	0.55	0.72

Post-Teen Pop by Mode: 577 major songs; 377 minor songs.

When you compare these statistics to those of the data summary, it is interesting to note the post-teen pop subgenre exhibits higher means in all three numeric predictor variables.

Summary

At the start, we wanted to address the problem of whether or not energy, valence and/or loudness had a linear relationship with track popularity. We also investigated if the popular songs in this data set were in a minor key more often than in a major key. In both of these pursuits, we concluded that track popularity is not heavily influenced by any of those factors (valence, energy, mode, or loudness). Then, we further investigated whether or not genre affected popularity, and concluded that it does to a small extent.

We addressed our original problem statement by interpreting linear regression metrics to determine that they did not strongly impact track popularity in our data set. Using boxplots, we determined that the quartiles and medians of track popularity did not noticeably change based on the key of a song. So, we concluded that mode does not impact track popularity in this data set. From there, we did a visual analysis of boxplots of playlist genre and subgenre with track popularity. Upon examination of our subgenre boxplot, we noticed that the post-teen pop subgenre had the highest track popularity median. We concluded our study with a summary of the metrics of interest from the most popular subgenre, post-teen pop.

Among our target audience, we can recommend post-teen pop music to someone interested looking for popular music that may be new to them. As for members of the music industry looking to make popular music, they may be interested in recreating music that fits the metrics of a post-teen pop song. (Although it may not be groundbreaking that pop music is popular, it is still a valid insight that is supported by our data.)

We realize that while we were able to make valid conclusions here, the metrics we were originally interested in were hindered by the limitations of this data set. The linear regression that we did was not particularly insightful, as plots (that we did not end up including) were messy scatter plots with close to horizontal regression lines. That is, these plots did not demonstrate any strong linear relationship among track popularity and our predictor variables. This does not mean that such a relationship does not exist within a larger and more diverse sampling of Spotify data. Furthermore, our limited domain knowledge in the calculation of metrics like track_popularity may affect the interpretation of our analysis depending on the correct definition of popularity in its original context. Perhaps with a deeper understanding of Spotify’s variable definitions and a sampling of data from a larger selection of genres, one could improve upon our insights and revisit these variables and their relationships to track_popularity.

https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md ↩