Analyze the Spotify database to:
This project will be executed in two major phases:
Phase 1: Analyze the data and look for associations between song characteristics and song genres & sub-genres. This will include data clean-up, data wrangling and data visualization.
Phase 2: Create models to predict song popularity based on most relevant song characteristics identified in phase 1. This phase will include variable selection and evaluation of various model architectures (to be delivered on 8/13/21)
Learnings from these analyses and song popularity models will be used by the MakeYourSong (made up) start-up to guide its users on what song characteristics are likely to drive popularity. The predictive model will be available to users of the MakeYourSong start-up.
#install.packages("tidyverse")
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("plotly")
#install.packages("corrplot")
library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(corrplot)
tidyverse - for interacting with data through subsetting, transformation, visualization, etc.
dplyr - for data manipulation in R by combining, selecting, grouping, subsetting and transforming all or parts of dataset
ggplot2 - for declaratively creating graphics, based on The Grammar of Graphics
plotly - for creating interactive web-based graphs via the open source JavaScript graphing library plotly.js
corrplot - for visualizing correlation matrices and confidence intervals
The dataset is available in Github. Link to the data source is here.
The data to be analyzed is be a excerpt of the Spotify database containing 32,833 rows. The data set of spotify songs contains 23 variables and 32,833 songs from 1957-2020. There are 10,693 artists and 6 main genres with sub-categories for each. There are 12 audio features for each track, including confidence measures like acousticness, liveness, speechiness and instrumentalness, perceptual measures like energy, loudness, danceability and valence (positiveness), and descriptors like duration, tempo, key, and mode.
Genres were selected from Every Noise, a visualization of the Spotify genre-space maintained by a genre taxonomist. The top four sub-genres for each were used to query Spotify for 20 playlists each, resulting in about 5000 songs for each genre, split across a varied sub-genre space.
You can find the code for generating the dataset in spotify_dataset.R in the full Github repo.
# Code to import the data
spotify <- read.csv("C:/Users/king.nm/OneDrive - Procter and Gamble/UC/MS Business Analytics/Classes/Summer 2021/Data Wrangling BANA 7025 - Jun2021/Final Project/spotify.csv")
dictionary_spotify <- read.csv("C:/Users/king.nm/OneDrive - Procter and Gamble/UC/MS Business Analytics/Classes/Summer 2021/Data Wrangling BANA 7025 - Jun2021/Final Project/dictionary_spotify.csv")
# Code to view data Spotify codebook
# Use library knitr to format codebook table
library(knitr)
## Warning: package 'knitr' was built under R version 4.0.5
kable(dictionary_spotify[,], caption = "Spotify Codebook")
| variable | class | description |
|---|---|---|
| track_id | character | Song unique ID |
| track_name | character | Song Name |
| track_artist | character | Song Artist |
| track_popularity | double | Song Popularity (0-100) where higher is better |
| track_album_id | character | Album unique ID |
| track_album_name | character | Song album name |
| track_album_release_date | character | Date when album released |
| playlist_name | character | Name of playlist |
| playlist_id | character | Playlist ID |
| playlist_genre | character | Playlist genre |
| playlist_subgenre | character | Playlist subgenre |
| danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C?/D?, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | double | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | double | Duration of song in milliseconds |
dim(spotify)
## [1] 28352 43
# Checking to see whether there are songs with the same ID
length(unique(spotify$track_id))
## [1] 28352
# Creating a new file with unique songs
spotify_unique = spotify[!duplicated(spotify$track_id),]
str(spotify_unique)
# Shortening the name of spotify_unique to spotify only
spotify <- spotify_unique
# Checking whether the unique file contains only 28356
str(spotify)
head(spotify, n=5)
tail(spotify, n=5)
sum(is.na(spotify))
colSums(is.na(spotify))
# Eliminating missing data since there are not too many missing values
spotify <- na.omit(spotify)
# Checking whether missing data was omitted
str(spotify)
summary(spotify)
hist(spotify$danceability)
hist(spotify$energy)
hist(spotify$loudness)
hist(spotify$speechiness)
hist(spotify$acousticness)
hist(spotify$instrumentalness)
hist(spotify$liveness)
hist(spotify$valence)
hist(spotify$tempo)
hist(spotify$key)
hist(spotify$mode)
hist(spotify$track_popularity)
hist(spotify$duration_ms)
library(knitr)
kable(table(spotify$playlist_genre), align = "l", caption = "Playlist genre frequencies")
| Var1 | Freq |
|---|---|
| edm | 4877 |
| latin | 4136 |
| pop | 5132 |
| r&b | 4504 |
| rap | 5398 |
| rock | 4305 |
kable(table(spotify$playlist_subgenre),align = "l", caption = "Playlist sub-genre frequencies" )
| Var1 | Freq |
|---|---|
| album rock | 1039 |
| big room | 1034 |
| classic rock | 1100 |
| dance pop | 1298 |
| electro house | 1416 |
| electropop | 1251 |
| gangster rap | 1314 |
| hard rock | 1202 |
| hip hop | 1296 |
| hip pop | 803 |
| indie poptimism | 1547 |
| latin hip hop | 1194 |
| latin pop | 1097 |
| neo soul | 1478 |
| new jack swing | 1036 |
| permanent wave | 964 |
| pop edm | 967 |
| post-teen pop | 1036 |
| progressive electro house | 1460 |
| reggaeton | 687 |
| southern hip hop | 1582 |
| trap | 1206 |
| tropical | 1158 |
| urban contemporary | 1187 |
barplot(table(spotify$playlist_genre))
boxplot(spotify$track_popularity,xlab = "popularity")
boxplot(spotify$danceability,xlab = "danceability")
boxplot(spotify$duration_ms, xlab = "duration_ms")
boxplot(spotify$energy, xlab = "energy")
boxplot(spotify$loudness, xlab = "loudness")
boxplot(spotify$speechiness, xlab = "speechiness")
boxplot(spotify$acousticness, xlab = "accousticness")
boxplot(spotify$instrumentalness, xlab = "instumentalness")
boxplot(spotify$liveness, xlab = "liveness")
boxplot(spotify$valence, xlab = "valence")
boxplot(spotify$tempo, xlab = "tempo")
All the variables evaluated have outliers: danceability, duration, energy, loudness, speechiness, accousticness, instrumentalness, liveness and tempo
length(unique(spotify$track_artist))
## [1] 10692
length(unique(spotify$playlist_id))
## [1] 470
length(unique(spotify$playlist_id))
## [1] 470
plot(spotify$liveness, spotify$tempo)
plot(spotify$speechiness, spotify$liveness)
plot(spotify$liveness, spotify$track_popularity)
plot(spotify$energy, spotify$track_popularity)
plot(spotify$loudness, spotify$track_popularity)
plot(spotify$key, spotify$track_popularity)
plot(spotify$speechiness, spotify$track_popularity)
# Creating a subset of the data with numeric variables only to more easily check for correlations
library(tidyverse)
spotify_num <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, key, mode)
# Checking for variable correlations
spotify_corr <- cor(spotify_num)
corrplot(spotify_corr, type = "lower", tl.srt = 20)
library(tidyverse)
spotify_m <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, key, mode, playlist_genre, playlist_subgenre)
# Code to transform the character variables into factors
spotify_m$playlist_genre <- as.factor(spotify_m$playlist_genre)
spotify_m$playlist_subgenre <- as.factor(spotify_m$playlist_subgenre)
# Checking whether the factors were created
str(spotify_m)
# Summary of the clean dataset
summary(spotify_m)
Learning: Variables speechiness, acousticness, instrumentalness and liveness are highly skewed, with a signficant number of outliers. These variables will need to be analyzed to decide whether they should be part of the analyses and predictive model.
# Summarize the clean dataset using means
summ1 <- summarise(spotify_m, popular_mean = mean(spotify_m$track_popularity, na.rm = TRUE),
danceab_mean = mean(spotify_m$danceability, na.rm = TRUE),
energ_mean = mean(spotify_m$energy, na.rm = TRUE),
loud_mean = mean(spotify_m$loudness, na.rm = TRUE),
speech_mean = mean(spotify_m$speechiness, na.rm = TRUE),
acoust_mean = mean(spotify_m$acousticness, na.rm = TRUE),
instr_mean = mean(spotify_m$instrumentalness, na.rm = TRUE),
liven_mean = mean(spotify_m$liveness, na.rm = TRUE),
valen_mean = mean(spotify_m$valence, na.rm = TRUE),
tempo_mean = mean(spotify_m$tempo, na.rm = TRUE),
key_mean = mean(spotify_m$key, na.rm = TRUE),
mode_mean = mean(spotify_m$mode, na.rm = TRUE),
loud_mean = mean(spotify_m$loudness, na.rm = TRUE),
n = n())
# Summarize the clean dataset using ranges
summ2 <- summarise(spotify_m, popular_mean = mean(spotify_m$track_popularity, na.rm = TRUE),
danceab_range = range(spotify_m$danceability, na.rm = TRUE),
energ_range = range(spotify_m$energy, na.rm = TRUE),
loud_range = range(spotify_m$loudness, na.rm = TRUE),
speech_range = range(spotify_m$speechiness, na.rm = TRUE),
acoust_range = range(spotify_m$acousticness, na.rm = TRUE),
instr_range = range(spotify_m$instrumentalness, na.rm = TRUE),
liven_range = range(spotify_m$liveness, na.rm = TRUE),
valen_range = range(spotify_m$valence, na.rm = TRUE),
tempo_range = range(spotify_m$tempo, na.rm = TRUE),
key_range = range(spotify_m$key, na.rm = TRUE),
mode_range = range(spotify_m$mode, na.rm = TRUE),
loud_range = range(spotify_m$loudness, na.rm = TRUE),
n = n() )
# Printing the two key summary tables
print(list(summ1, summ2))
As shown below, the variables speeachiness, acousticness, instrumentalness and liveness are highly skewed. In the case of instrumentalness, the median is zero. The median for the other three variables is also significantly closer to the minimum value vs. maximum value. These variables may need to be re-scaled or eliminated from the model.
# Variables of concerns
spotify_conc <- select(spotify_m, speechiness, acousticness, instrumentalness, liveness)
summary(spotify_conc)
## speechiness acousticness instrumentalness liveness
## Min. :0.0000 Min. :0.0000 Min. :0.0000000 Min. :0.0000
## 1st Qu.:0.0410 1st Qu.:0.0143 1st Qu.:0.0000000 1st Qu.:0.0926
## Median :0.0626 Median :0.0797 Median :0.0000207 Median :0.1270
## Mean :0.1079 Mean :0.1772 Mean :0.0911294 Mean :0.1910
## 3rd Qu.:0.1330 3rd Qu.:0.2600 3rd Qu.:0.0065725 3rd Qu.:0.2490
## Max. :0.9180 Max. :0.9940 Max. :0.9940000 Max. :0.9960
As part of the data analysis and modeling of this data, I looked at correlation, skewness, outliers and value frequency measures. I sliced the data into low (bottom quartile) and high (top quartile) popularity scores to try to explain song popularity.
Correlation and clustered column plots helped me determine what songs characteristics are more present in the most popular songs.
spotify %>%
group_by(playlist_genre) %>%
summarise(
mean_popul = mean(track_popularity, na.rm = TRUE),
mean_liven = mean(liveness, na.rm = TRUE),
mean_speech = mean(speechiness, na.rm = TRUE),
mean_instr = mean(instrumentalness, na.rm = TRUE),
mean_acoust = mean(acousticness, na.rm = TRUE),
mean_loud = mean(loudness, na.rm = TRUE),
mean_danc = mean(danceability, na.rm = TRUE),
mean_energy = mean(energy, na.rm = TRUE),
mean_valence = mean(valence, na.rm = TRUE),
mean_durat = mean(duration_ms, na.rm = TRUE),
mean_mode = mean(mode, na.rm = TRUE),
mean_tempo = mean(tempo, na.rm = TRUE),
mean_key = mean(key, na.rm = TRUE))%>%
arrange(desc(mean_popul))
## # A tibble: 6 x 14
## playlist_genre mean_popul mean_liven mean_speech mean_instr mean_acoust
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 pop 45.9 0.177 0.0742 0.0634 0.172
## 2 rap 41.8 0.191 0.197 0.0802 0.197
## 3 latin 41.4 0.182 0.100 0.0526 0.213
## 4 rock 39.7 0.205 0.0579 0.0664 0.147
## 5 r&b 35.9 0.176 0.116 0.0285 0.264
## 6 edm 30.7 0.214 0.0879 0.245 0.0769
## # ... with 8 more variables: mean_loud <dbl>, mean_danc <dbl>,
## # mean_energy <dbl>, mean_valence <dbl>, mean_durat <dbl>, mean_mode <dbl>,
## # mean_tempo <dbl>, mean_key <dbl>
# Focusing the analysis on top quartile of popularity (goal is to determine whether the most popular songs have common characteristics)
topqtpop <-
spotify %>%
filter(track_popularity > 58) %>%
arrange(desc(track_popularity))
str(topqtpop)
# Summarizing the characteristics for the top quantile most popular songs
t1 <- topqtpop %>%
group_by(playlist_genre) %>%
summarise(
mean_popul1 = mean(track_popularity, na.rm = TRUE),
mean_liven1 = mean(liveness, na.rm = TRUE),
mean_speech1 = mean(speechiness, na.rm = TRUE),
mean_instr1 = mean(instrumentalness, na.rm = TRUE),
mean_acoust1 = mean(acousticness, na.rm = TRUE),
mean_loud1 = mean(loudness, na.rm = TRUE),
mean_danc1 = mean(danceability, na.rm = TRUE),
mean_energy1 = mean(energy, na.rm = TRUE),
mean_valence1 = mean(valence, na.rm = TRUE),
mean_durat1 = mean(duration_ms, na.rm = TRUE),
mean_mode1 = mean(mode, na.rm = TRUE),
mean_tempo1 = mean(tempo, na.rm = TRUE),
mean_key1 = mean(key, na.rm = TRUE))%>%
arrange(desc(mean_popul1))
# Sort t1 by playlist_genre
t1 <- arrange(t1, playlist_genre)
# Focusing the analysis on bottom quartile of popularity (goal is to determine whether the least popular songs have common characteristics)
botqtpop <-
spotify %>%
filter(track_popularity < 21) %>%
arrange(desc(track_popularity))
str(botqtpop)
# Summarizing the characteristics for the bottom quantile least popular songs
b1 <- botqtpop %>%
group_by(playlist_genre) %>%
summarise(
mean_popul2 = mean(track_popularity, na.rm = TRUE),
mean_liven2 = mean(liveness, na.rm = TRUE),
mean_speech2 = mean(speechiness, na.rm = TRUE),
mean_instr2 = mean(instrumentalness, na.rm = TRUE),
mean_acoust2 = mean(acousticness, na.rm = TRUE),
mean_loud2 = mean(loudness, na.rm = TRUE),
mean_danc2 = mean(danceability, na.rm = TRUE),
mean_energy2 = mean(energy, na.rm = TRUE),
mean_valence2 = mean(valence, na.rm = TRUE),
mean_durat2 = mean(duration_ms, na.rm = TRUE),
mean_mode2 = mean(mode, na.rm = TRUE),
mean_tempo2 = mean(tempo, na.rm = TRUE),
mean_key2 = mean(key, na.rm = TRUE))%>%
arrange(desc(mean_popul2))
# Sort b1 by playlist_genre
b1 <- arrange(b1, playlist_genre)
# Calculating the difference in percent between the most and least popular songs for all song characteristics
dif_db <- 100 * (t1[-1] - b1[-1]) / t1[-1]
# Rounding the numbers of the dataset with the percent difference between most and least popular songs
#round(dif_db[-1], 0)
# Add the column 1 back
# Add the columns from the second dataframe to the first
dif_db <- cbind(dif_db, b1[1])
# Renaming the column names
# colnames(df) <- c('C1','C2','C3')
colnames(dif_db) <- c("popul", "liven", "speech", "instrum", "acoust", "loud", "dance", "energy", "valence", "durat", "mode", "tempo", "key", "genre")
Conclusion: Song popularity seems to be associated to a higher level of acousticness and a lower level of instrumentalness as shown by the chart below.
library(tidyr)
library(ggplot2)
dat.g <- gather(dif_db[2:14], type, value, -genre)
ggplot(dat.g, aes(type, value)) +
geom_bar(aes(fill = genre), stat = "identity", position = "dodge") +
geom_vline(xintercept = c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.5, 13.5 )) +
theme_get() +
scale_x_discrete(name = "Musical Characteristics") +
scale_y_continuous(name = "%") +
ggtitle ("Comparing Most (top quantile) vs. Least (bottom quantile) Popular Songs",
subtitle = "% variation between top vs. bottom quantile of song popularity for 12 musical characteristics")
I know that some variables are highly skewed and could lead to low-accuracy predictive models for popularity. I will take a look at this for phase 2 of this project.
I plan to explore machine learning techniques such as linear regression, trees, cluster analysis and other model architectures to develop a predictive model for song popularity.