Can We Explain Song Popularity?

1. Introduction

1.1. Project Objectives

Analyze the Spotify database to:

Understand how song characteristics (e.g. danceability, liveness) might be associated to different song genres (e.g. pop, rock).
Create a model to predict song popularity based on song characteristics.

1.2. Plan to Deliver Against Project Objectives

Conduct preliminary data analyses to determine what cleaning steps, if any, are needed
Clean the data
Determine what variables might be correlated
Fit various types of models to the data, starting with the simplest models (e.g. multiple linear regression, trees)
Select best model
Re-evaluate variables to determine whether further data cleaning and/or collection should be recommended
Final summary and recommendations

1.3. Analysis and Modeling Proposal

This project will be executed in two major phases:

Phase 1: Analyze the data and look for associations between song characteristics and song genres & sub-genres. This will include data clean-up, data wrangling and data visualization.
Phase 2: Create models to predict song popularity based on most relevant song characteristics identified in phase 1. This phase will include variable selection and evaluation of various model architectures (to be delivered on 8/13/21)

1.4. Expected Output

Learnings from these analyses and song popularity models will be used by the MakeYourSong (made up) start-up to guide its users on what song characteristics are likely to drive popularity. The predictive model will be available to users of the MakeYourSong start-up.

2. Packages Required

2.1. Packages Required

2.2. Messages and warnings resulting from loading the packages are suppressed

#install.packages("tidyverse")
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("plotly")
#install.packages("corrplot")

library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(corrplot)

2.3. Package Short Description

tidyverse - for interacting with data through subsetting, transformation, visualization, etc.

dplyr - for data manipulation in R by combining, selecting, grouping, subsetting and transforming all or parts of dataset

ggplot2 - for declaratively creating graphics, based on The Grammar of Graphics

plotly - for creating interactive web-based graphs via the open source JavaScript graphing library plotly.js

corrplot - for visualizing correlation matrices and confidence intervals

3. Data Preparation

3.1. Data Source

The dataset is available in Github. Link to the data source is here.

3.2. Explanation of Data Source

The data to be analyzed is be a excerpt of the Spotify database containing 32,833 rows. The data set of spotify songs contains 23 variables and 32,833 songs from 1957-2020. There are 10,693 artists and 6 main genres with sub-categories for each. There are 12 audio features for each track, including confidence measures like acousticness, liveness, speechiness and instrumentalness, perceptual measures like energy, loudness, danceability and valence (positiveness), and descriptors like duration, tempo, key, and mode.

Genres were selected from Every Noise, a visualization of the Spotify genre-space maintained by a genre taxonomist. The top four sub-genres for each were used to query Spotify for 20 playlists each, resulting in about 5000 songs for each genre, split across a varied sub-genre space.

You can find the code for generating the dataset in spotify_dataset.R in the full Github repo.

3.3. Data Importing and Cleaning

# Code to import the data
spotify <- read.csv("C:/Users/king.nm/OneDrive - Procter and Gamble/UC/MS Business Analytics/Classes/Summer 2021/Data Wrangling BANA 7025 - Jun2021/Final Project/spotify.csv")

Identifying and reviewing the codebook

dictionary_spotify <- read.csv("C:/Users/king.nm/OneDrive - Procter and Gamble/UC/MS Business Analytics/Classes/Summer 2021/Data Wrangling BANA 7025 - Jun2021/Final Project/dictionary_spotify.csv")

# Code to view data Spotify codebook
# Use library knitr to format codebook table
library(knitr)

## Warning: package 'knitr' was built under R version 4.0.5

kable(dictionary_spotify[,], caption = "Spotify Codebook")

Spotify Codebook
variable	class	description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	double	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Name of playlist
playlist_id	character	Playlist ID
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	double	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	double	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C?/D?, 2 = D, and so on. If no key was detected, the value is -1.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	double	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	double	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	double	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	double	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	double	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	double	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	double	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	double	Duration of song in milliseconds

Assessing dimensions of the dataset

dim(spotify)

## [1] 28352    43

Check for duplicate rows or columns

# Checking to see whether there are songs with the same ID
length(unique(spotify$track_id))

## [1] 28352

# Creating a new file with unique songs
spotify_unique = spotify[!duplicated(spotify$track_id),]
str(spotify_unique)

# Shortening the name of spotify_unique to spotify only
spotify <- spotify_unique

# Checking whether the unique file contains only 28356
str(spotify)

Viewing the head and tail of the data

head(spotify, n=5)
tail(spotify, n=5)

Cleaning the Data (explanation of the data cleaning steps)

Identify missing data
Determine how to handle missing data
Looking for outliers
Determine how to handle outliers
Frequency distribution for the variables

Identifying missing data

sum(is.na(spotify))
colSums(is.na(spotify))

# Eliminating missing data since there are not too many missing values
spotify <- na.omit(spotify)

# Checking whether missing data was omitted
str(spotify)

Computing summary statistics for the variables

summary(spotify)

Learn about the data visually by plotting:

Histograms of numeric variables

hist(spotify$danceability)
hist(spotify$energy)
hist(spotify$loudness)
hist(spotify$speechiness)
hist(spotify$acousticness)
hist(spotify$instrumentalness)
hist(spotify$liveness)
hist(spotify$valence)
hist(spotify$tempo)
hist(spotify$key)
hist(spotify$mode)
hist(spotify$track_popularity)
hist(spotify$duration_ms)

Tables for character variables

library(knitr)
kable(table(spotify$playlist_genre), align = "l", caption = "Playlist genre frequencies")

Playlist genre frequencies
Var1	Freq
edm	4877
latin	4136
pop	5132
r&b	4504
rap	5398
rock	4305

kable(table(spotify$playlist_subgenre),align = "l", caption = "Playlist sub-genre frequencies" )

Playlist sub-genre frequencies
Var1	Freq
album rock	1039
big room	1034
classic rock	1100
dance pop	1298
electro house	1416
electropop	1251
gangster rap	1314
hard rock	1202
hip hop	1296
hip pop	803
indie poptimism	1547
latin hip hop	1194
latin pop	1097
neo soul	1478
new jack swing	1036
permanent wave	964
pop edm	967
post-teen pop	1036
progressive electro house	1460
reggaeton	687
southern hip hop	1582
trap	1206
tropical	1158
urban contemporary	1187

Bar Plots for Song Genre

barplot(table(spotify$playlist_genre))

Box plots – looking for outliers

boxplot(spotify$track_popularity,xlab = "popularity")
boxplot(spotify$danceability,xlab = "danceability")
boxplot(spotify$duration_ms, xlab = "duration_ms")
boxplot(spotify$energy, xlab = "energy")
boxplot(spotify$loudness, xlab = "loudness")
boxplot(spotify$speechiness, xlab = "speechiness")
boxplot(spotify$acousticness, xlab = "accousticness")
boxplot(spotify$instrumentalness, xlab = "instumentalness")
boxplot(spotify$liveness, xlab = "liveness")
boxplot(spotify$valence, xlab = "valence")
boxplot(spotify$tempo, xlab = "tempo")

All the variables evaluated have outliers: danceability, duration, energy, loudness, speechiness, accousticness, instrumentalness, liveness and tempo

Number of Artists

length(unique(spotify$track_artist))

## [1] 10692

Number of Playlists IDs

length(unique(spotify$playlist_id))

## [1] 470

Number of Playlist Names

length(unique(spotify$playlist_id))

## [1] 470

Scatter plots to Look for Correlations Between Variables

plot(spotify$liveness, spotify$tempo)
plot(spotify$speechiness, spotify$liveness)
plot(spotify$liveness, spotify$track_popularity)
plot(spotify$energy, spotify$track_popularity)
plot(spotify$loudness, spotify$track_popularity)
plot(spotify$key, spotify$track_popularity)
plot(spotify$speechiness, spotify$track_popularity)

# Creating a subset of the data with numeric variables only to more easily check for correlations
library(tidyverse)
spotify_num <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, key, mode)

# Checking for variable correlations
spotify_corr <- cor(spotify_num)
corrplot(spotify_corr, type = "lower", tl.srt = 20)

Create the final clean file with numeric, genre and sub-genre variables that will be used for modeling

library(tidyverse)
spotify_m <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, key, mode, playlist_genre, playlist_subgenre)

3.4. Clean Data (show the data in the most condensed form possible)

Transforming genre and sub-genre variables into factors

# Code to transform the character variables into factors
spotify_m$playlist_genre <- as.factor(spotify_m$playlist_genre)
spotify_m$playlist_subgenre <- as.factor(spotify_m$playlist_subgenre)
# Checking whether the factors were created
str(spotify_m)

# Summary of the clean dataset
summary(spotify_m)

Learning: Variables speechiness, acousticness, instrumentalness and liveness are highly skewed, with a signficant number of outliers. These variables will need to be analyzed to decide whether they should be part of the analyses and predictive model.

# Summarize the clean dataset using means
summ1 <- summarise(spotify_m, popular_mean = mean(spotify_m$track_popularity, na.rm = TRUE),
          danceab_mean = mean(spotify_m$danceability, na.rm = TRUE),
          energ_mean = mean(spotify_m$energy, na.rm = TRUE),
          loud_mean = mean(spotify_m$loudness, na.rm = TRUE), 
          speech_mean = mean(spotify_m$speechiness, na.rm = TRUE),
          acoust_mean = mean(spotify_m$acousticness, na.rm = TRUE),
          instr_mean = mean(spotify_m$instrumentalness, na.rm = TRUE),
          liven_mean = mean(spotify_m$liveness, na.rm = TRUE),
          valen_mean = mean(spotify_m$valence, na.rm = TRUE),
          tempo_mean = mean(spotify_m$tempo, na.rm = TRUE),
          key_mean = mean(spotify_m$key, na.rm = TRUE),
          mode_mean = mean(spotify_m$mode, na.rm = TRUE),
          loud_mean = mean(spotify_m$loudness, na.rm = TRUE),
          n = n())

# Summarize the clean dataset using ranges
summ2 <- summarise(spotify_m, popular_mean = mean(spotify_m$track_popularity, na.rm = TRUE),
          danceab_range = range(spotify_m$danceability, na.rm = TRUE),
          energ_range = range(spotify_m$energy, na.rm = TRUE),
          loud_range = range(spotify_m$loudness, na.rm = TRUE), 
          speech_range = range(spotify_m$speechiness, na.rm = TRUE),
          acoust_range = range(spotify_m$acousticness, na.rm = TRUE),
          instr_range = range(spotify_m$instrumentalness, na.rm = TRUE),
          liven_range = range(spotify_m$liveness, na.rm = TRUE),
          valen_range = range(spotify_m$valence, na.rm = TRUE),
          tempo_range = range(spotify_m$tempo, na.rm = TRUE),
          key_range = range(spotify_m$key, na.rm = TRUE),
          mode_range = range(spotify_m$mode, na.rm = TRUE),
          loud_range = range(spotify_m$loudness, na.rm = TRUE),
          n = n() )

# Printing the two key summary tables
print(list(summ1, summ2))

3.5 Provide summary information about the variables of concern in your cleaned data set.

As shown below, the variables speeachiness, acousticness, instrumentalness and liveness are highly skewed. In the case of instrumentalness, the median is zero. The median for the other three variables is also significantly closer to the minimum value vs. maximum value. These variables may need to be re-scaled or eliminated from the model.

# Variables of concerns
spotify_conc <- select(spotify_m, speechiness, acousticness, instrumentalness, liveness)
summary(spotify_conc)

##   speechiness      acousticness    instrumentalness       liveness     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000  
##  1st Qu.:0.0410   1st Qu.:0.0143   1st Qu.:0.0000000   1st Qu.:0.0926  
##  Median :0.0626   Median :0.0797   Median :0.0000207   Median :0.1270  
##  Mean   :0.1079   Mean   :0.1772   Mean   :0.0911294   Mean   :0.1910  
##  3rd Qu.:0.1330   3rd Qu.:0.2600   3rd Qu.:0.0065725   3rd Qu.:0.2490  
##  Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960

4. Proposed Exploratory Data Analysis

4.1 Discuss how you plan to uncover new information in the data that is not self-evident. What are different ways you could look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information? How could you summarize your data to answer key questions?

As part of the data analysis and modeling of this data, I looked at correlation, skewness, outliers and value frequency measures. I sliced the data into low (bottom quartile) and high (top quartile) popularity scores to try to explain song popularity.

4.2 What types of plots and tables will help you to illustrate the findings to your questions?

Correlation and clustered column plots helped me determine what songs characteristics are more present in the most popular songs.

Continuing the analysis with break-outs

spotify %>%
  group_by(playlist_genre) %>%
  summarise(
    mean_popul = mean(track_popularity, na.rm = TRUE),
    mean_liven = mean(liveness, na.rm = TRUE),
    mean_speech = mean(speechiness, na.rm = TRUE),
    mean_instr = mean(instrumentalness, na.rm = TRUE),
    mean_acoust = mean(acousticness, na.rm = TRUE),
    mean_loud = mean(loudness, na.rm = TRUE),
    mean_danc = mean(danceability, na.rm = TRUE),
    mean_energy = mean(energy, na.rm = TRUE),
    mean_valence = mean(valence, na.rm = TRUE),
    mean_durat = mean(duration_ms, na.rm = TRUE),
    mean_mode = mean(mode, na.rm = TRUE),
    mean_tempo = mean(tempo, na.rm = TRUE),
    mean_key = mean(key, na.rm = TRUE))%>%
  arrange(desc(mean_popul))

## # A tibble: 6 x 14
##   playlist_genre mean_popul mean_liven mean_speech mean_instr mean_acoust
##   <chr>               <dbl>      <dbl>       <dbl>      <dbl>       <dbl>
## 1 pop                  45.9      0.177      0.0742     0.0634      0.172 
## 2 rap                  41.8      0.191      0.197      0.0802      0.197 
## 3 latin                41.4      0.182      0.100      0.0526      0.213 
## 4 rock                 39.7      0.205      0.0579     0.0664      0.147 
## 5 r&b                  35.9      0.176      0.116      0.0285      0.264 
## 6 edm                  30.7      0.214      0.0879     0.245       0.0769
## # ... with 8 more variables: mean_loud <dbl>, mean_danc <dbl>,
## #   mean_energy <dbl>, mean_valence <dbl>, mean_durat <dbl>, mean_mode <dbl>,
## #   mean_tempo <dbl>, mean_key <dbl>

# Focusing the analysis on top quartile of popularity (goal is to determine whether the most popular songs have common characteristics)
topqtpop <-
spotify %>%
  filter(track_popularity > 58) %>%
  arrange(desc(track_popularity))
str(topqtpop)

# Summarizing the characteristics for the top quantile most popular songs

t1 <- topqtpop %>%
  group_by(playlist_genre) %>%
  summarise(
    mean_popul1 = mean(track_popularity, na.rm = TRUE),
    mean_liven1 = mean(liveness, na.rm = TRUE),
    mean_speech1 = mean(speechiness, na.rm = TRUE),
    mean_instr1 = mean(instrumentalness, na.rm = TRUE),
    mean_acoust1 = mean(acousticness, na.rm = TRUE),
    mean_loud1 = mean(loudness, na.rm = TRUE),
    mean_danc1 = mean(danceability, na.rm = TRUE),
    mean_energy1 = mean(energy, na.rm = TRUE),
    mean_valence1 = mean(valence, na.rm = TRUE),
    mean_durat1 = mean(duration_ms, na.rm = TRUE),
    mean_mode1 = mean(mode, na.rm = TRUE),
    mean_tempo1 = mean(tempo, na.rm = TRUE),
    mean_key1 = mean(key, na.rm = TRUE))%>%
  arrange(desc(mean_popul1))

# Sort t1 by playlist_genre
t1 <- arrange(t1, playlist_genre)

# Focusing the analysis on bottom quartile of popularity (goal is to determine whether the least popular songs have common characteristics)
botqtpop <-
spotify %>%
  filter(track_popularity < 21) %>%
  arrange(desc(track_popularity))
str(botqtpop)

# Summarizing the characteristics for the bottom quantile least popular songs

b1 <- botqtpop %>%
  group_by(playlist_genre) %>%
  summarise(
    mean_popul2 = mean(track_popularity, na.rm = TRUE),
    mean_liven2 = mean(liveness, na.rm = TRUE),
    mean_speech2 = mean(speechiness, na.rm = TRUE),
    mean_instr2 = mean(instrumentalness, na.rm = TRUE),
    mean_acoust2 = mean(acousticness, na.rm = TRUE),
    mean_loud2 = mean(loudness, na.rm = TRUE),
    mean_danc2 = mean(danceability, na.rm = TRUE),
    mean_energy2 = mean(energy, na.rm = TRUE),
    mean_valence2 = mean(valence, na.rm = TRUE),
    mean_durat2 = mean(duration_ms, na.rm = TRUE),
    mean_mode2 = mean(mode, na.rm = TRUE),
    mean_tempo2 = mean(tempo, na.rm = TRUE),
    mean_key2 = mean(key, na.rm = TRUE))%>%
  arrange(desc(mean_popul2))

# Sort b1 by playlist_genre
b1 <- arrange(b1, playlist_genre)

# Calculating the difference in percent between the most and least popular songs for all song characteristics
dif_db <- 100 * (t1[-1] - b1[-1]) / t1[-1]

# Rounding the numbers of the dataset with the percent difference between most and least popular songs
#round(dif_db[-1], 0)

# Add the column 1 back
# Add the columns from the second dataframe to the first

dif_db <- cbind(dif_db, b1[1])

# Renaming the column names
# colnames(df) <- c('C1','C2','C3')
colnames(dif_db) <- c("popul", "liven", "speech", "instrum", "acoust", "loud", "dance", "energy", "valence", "durat", "mode", "tempo", "key", "genre")

Conclusion: Song popularity seems to be associated to a higher level of acousticness and a lower level of instrumentalness as shown by the chart below.

library(tidyr)
library(ggplot2)
dat.g <- gather(dif_db[2:14], type, value, -genre)
ggplot(dat.g, aes(type, value)) + 
  geom_bar(aes(fill = genre), stat = "identity", position = "dodge") +
  geom_vline(xintercept = c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.5, 13.5 )) +
  theme_get() +
  scale_x_discrete(name = "Musical Characteristics") +
  scale_y_continuous(name = "%") +
  ggtitle ("Comparing Most (top quantile) vs. Least (bottom quantile) Popular Songs", 
           subtitle = "% variation between top vs. bottom quantile of song popularity for 12 musical characteristics")

4.3 What do you not know how to do right now that you need to learn to answer your questions?

I know that some variables are highly skewed and could lead to low-accuracy predictive models for popularity. I will take a look at this for phase 2 of this project.

4.4 Do you plan on incorporating any machine learning techniques (i.e. linear regression, discriminant analysis, cluster analysis) to answer your questions?

I plan to explore machine learning techniques such as linear regression, trees, cluster analysis and other model architectures to develop a predictive model for song popularity.